Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-18 Thread Hyukjin Kwon
One very good example is SparkR releases in Conda channel (
https://github.com/conda-forge/r-sparkr-feedstock).
This is fully run by the community unofficially.

On Tue, 19 Mar 2024 at 09:54, Mich Talebzadeh 
wrote:

> +1 for me
>
> Mich Talebzadeh,
> Dad | Technologist | Solutions Architect | Engineer
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner  Von
> Braun )".
>
>
> On Mon, 18 Mar 2024 at 16:23, Parsian, Mahmoud 
> wrote:
>
>> Good idea. Will be useful
>>
>>
>>
>> +1
>>
>>
>>
>>
>>
>>
>>
>> *From: *ashok34...@yahoo.com.INVALID 
>> *Date: *Monday, March 18, 2024 at 6:36 AM
>> *To: *user @spark , Spark dev list <
>> d...@spark.apache.org>, Mich Talebzadeh 
>> *Cc: *Matei Zaharia 
>> *Subject: *Re: A proposal for creating a Knowledge Sharing Hub for
>> Apache Spark Community
>>
>> External message, be mindful when clicking links or attachments
>>
>>
>>
>> Good idea. Will be useful
>>
>>
>>
>> +1
>>
>>
>>
>> On Monday, 18 March 2024 at 11:00:40 GMT, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>
>>
>>
>>
>> Some of you may be aware that Databricks community Home | Databricks
>>
>> have just launched a knowledge sharing hub. I thought it would be a
>>
>> good idea for the Apache Spark user group to have the same, especially
>>
>> for repeat questions on Spark core, Spark SQL, Spark Structured
>>
>> Streaming, Spark Mlib and so forth.
>>
>>
>>
>> Apache Spark user and dev groups have been around for a good while.
>>
>> They are serving their purpose . We went through creating a slack
>>
>> community that managed to create more more heat than light.. This is
>>
>> what Databricks community came up with and I quote
>>
>>
>>
>> "Knowledge Sharing Hub
>>
>> Dive into a collaborative space where members like YOU can exchange
>>
>> knowledge, tips, and best practices. Join the conversation today and
>>
>> unlock a wealth of collective wisdom to enhance your experience and
>>
>> drive success."
>>
>>
>>
>> I don't know the logistics of setting it up.but I am sure that should
>>
>> not be that difficult. If anyone is supportive of this proposal, let
>>
>> the usual +1, 0, -1 decide
>>
>>
>>
>> HTH
>>
>>
>>
>> Mich Talebzadeh,
>>
>> Dad | Technologist | Solutions Architect | Engineer
>>
>> London
>>
>> United Kingdom
>>
>>
>>
>>
>>
>>   view my Linkedin profile
>>
>>
>>
>>
>>
>> https://en.everybodywiki.com/Mich_Talebzadeh
>> 
>>
>>
>>
>>
>>
>>
>>
>> Disclaimer: The information provided is correct to the best of my
>>
>> knowledge but of course cannot be guaranteed . It is essential to note
>>
>> that, as with any advice, quote "one test result is worth one-thousand
>>
>> expert opinions (Werner Von Braun)".
>>
>>
>>
>> -
>>
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>>
>


Re: [ANNOUNCE] Apache Spark 3.5.1 released

2024-03-04 Thread Hyukjin Kwon
Is this related to https://github.com/apache/spark/pull/42428?

cc @Yang,Jie(INF) 

On Mon, 4 Mar 2024 at 22:21, Jungtaek Lim 
wrote:

> Shall we revisit this functionality? The API doc is built with individual
> versions, and for each individual version we depend on other released
> versions. This does not seem to be right to me. Also, the functionality is
> only in PySpark API doc which does not seem to be consistent as well.
>
> I don't think this is manageable with the current approach (listing
> versions in version-dependent doc). Let's say we release 3.4.3 after 3.5.1.
> Should we update the versions in 3.5.1 to add 3.4.3 in version switcher?
> How about the time we are going to release the new version after releasing
> 10 versions? What's the criteria of pruning the version?
>
> Unless we have a good answer to these questions, I think it's better to
> revert the functionality - it missed various considerations.
>
> On Fri, Mar 1, 2024 at 2:44 PM Jungtaek Lim 
> wrote:
>
>> Thanks for reporting - this is odd - the dropdown did not exist in other
>> recent releases.
>>
>> https://spark.apache.org/docs/3.5.0/api/python/index.html
>> https://spark.apache.org/docs/3.4.2/api/python/index.html
>> https://spark.apache.org/docs/3.3.4/api/python/index.html
>>
>> Looks like the dropdown feature was recently introduced but partially
>> done. The addition of a dropdown was done, but the way how to bump the
>> version was missed to be documented.
>> The contributor proposed the way to update the version "automatically",
>> but the PR wasn't merged. As a result, we are neither having the
>> instruction how to bump the version manually, nor having the automatic bump.
>>
>> * PR for addition of dropdown: https://github.com/apache/spark/pull/42428
>> * PR for automatically bumping version:
>> https://github.com/apache/spark/pull/42881
>>
>> We will probably need to add an instruction in the release process to
>> update the version. (For automatic bumping I don't have a good idea.)
>> I'll look into it. Please expect some delay during the holiday weekend
>> in S. Korea.
>>
>> Thanks again.
>> Jungtaek Lim (HeartSaVioR)
>>
>>
>> On Fri, Mar 1, 2024 at 2:14 PM Dongjoon Hyun 
>> wrote:
>>
>>> BTW, Jungtaek.
>>>
>>> PySpark document seems to show a wrong branch. At this time, `master`.
>>>
>>> https://spark.apache.org/docs/3.5.1/api/python/index.html
>>>
>>> PySpark Overview
>>> 
>>>
>>>Date: Feb 24, 2024 Version: master
>>>
>>> [image: Screenshot 2024-02-29 at 21.12.24.png]
>>>
>>>
>>> Could you do the follow-up, please?
>>>
>>> Thank you in advance.
>>>
>>> Dongjoon.
>>>
>>>
>>> On Thu, Feb 29, 2024 at 2:48 PM John Zhuge  wrote:
>>>
 Excellent work, congratulations!

 On Wed, Feb 28, 2024 at 10:12 PM Dongjoon Hyun 
 wrote:

> Congratulations!
>
> Bests,
> Dongjoon.
>
> On Wed, Feb 28, 2024 at 11:43 AM beliefer  wrote:
>
>> Congratulations!
>>
>>
>>
>> At 2024-02-28 17:43:25, "Jungtaek Lim" 
>> wrote:
>>
>> Hi everyone,
>>
>> We are happy to announce the availability of Spark 3.5.1!
>>
>> Spark 3.5.1 is a maintenance release containing stability fixes. This
>> release is based on the branch-3.5 maintenance branch of Spark. We
>> strongly
>> recommend all 3.5 users to upgrade to this stable release.
>>
>> To download Spark 3.5.1, head over to the download page:
>> https://spark.apache.org/downloads.html
>>
>> To view the release notes:
>> https://spark.apache.org/releases/spark-release-3-5-1.html
>>
>> We would like to acknowledge all community members for contributing
>> to this
>> release. This release would not have been possible without you.
>>
>> Jungtaek Lim
>>
>> ps. Yikun is helping us through releasing the official docker image
>> for Spark 3.5.1 (Thanks Yikun!) It may take some time to be generally
>> available.
>>
>>

 --
 John Zhuge

>>>


Re: Pyspark UDF as a data source for streaming

2023-12-28 Thread Hyukjin Kwon
Just fyi streaming python data source is in progress
https://github.com/apache/spark/pull/44416 we will likely release this in
spark 4.0

On Thu, Dec 28, 2023 at 4:53 PM Поротиков Станислав Вячеславович
 wrote:

> Yes, it's actual data.
>
>
>
> Best regards,
>
> Stanislav Porotikov
>
>
>
> *From:* Mich Talebzadeh 
> *Sent:* Wednesday, December 27, 2023 9:43 PM
> *Cc:* user@spark.apache.org
> *Subject:* Re: Pyspark UDF as a data source for streaming
>
>
>
> Is this generated data actual data or you are testing the application?
>
>
>
> Sounds like a form of Lambda architecture here with some
> decision/processing not far from the attached diagram
>
>
>
> HTH
>
>
> Mich Talebzadeh,
>
> Dad | Technologist | Solutions Architect | Engineer
>
> London
>
> United Kingdom
>
>
>
>  [image: Рисунок удален отправителем.]  view my Linkedin profile
> 
>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
>
>
>
> On Wed, 27 Dec 2023 at 13:26, Поротиков Станислав Вячеславович <
> s.poroti...@skbkontur.ru> wrote:
>
> Actually it's json with specific structure from API server.
>
> But the task is to check constantly if new data appears on API server and
> load it to Kafka.
>
> Full pipeline can be presented like that:
>
> REST API -> Kafka -> some processing -> Kafka/Mongo -> …
>
>
>
> Best regards,
>
> Stanislav Porotikov
>
>
>
> *From:* Mich Talebzadeh 
> *Sent:* Wednesday, December 27, 2023 6:17 PM
> *To:* Поротиков Станислав Вячеславович 
> *Cc:* user@spark.apache.org
> *Subject:* Re: Pyspark UDF as a data source for streaming
>
>
>
> Ok so you want to generate some random data and load it into Kafka on a
> regular interval and the rest?
>
>
>
> HTH
>
> Mich Talebzadeh,
>
> Dad | Technologist | Solutions Architect | Engineer
>
> London
>
> United Kingdom
>
>
>
>  [image: Рисунок удален отправителем.]  view my Linkedin profile
> 
>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
>
>
>
> On Wed, 27 Dec 2023 at 12:16, Поротиков Станислав Вячеславович
>  wrote:
>
> Hello!
>
> Is it possible to write pyspark UDF, generated data to streaming dataframe?
>
> I want to get some data from REST API requests in real time and consider
> to save this data to dataframe.
>
> And then put it to Kafka.
>
> I can't realise how to create streaming dataframe from generated data.
>
>
>
> I am new in spark streaming.
>
> Could you give me some hints?
>
>
>
> Best regards,
>
> Stanislav Porotikov
>
>
>
>


Re: Architecture of Spark Connect

2023-12-14 Thread Hyukjin Kwon
By default for now, yes. One Spark Connect server handles multiple Spark
Sessions. To multiplex or run multiple Drivers, you need some work such as
gateway.

On Thu, 14 Dec 2023 at 12:03, Kezhi Xiong  wrote:

> Hi,
>
> My understanding is there is only one driver/spark context for all user
> sessions. When you run the bin/start-connect-server script, you are
> submitting one long standing spark job / application. Every time a new user
> request comes in, a new user session is created under that. Please correct
> me if I am wrong.
>
> Kezhi
>
> On Thu, Dec 14, 2023 at 10:35 AM Nikhil Goyal  wrote:
>
>> [ External sender. Exercise caution. ]
>>
>> If multiple applications are running, we would need multiple spark
>> connect servers? If so, is the user responsible for creating these servers
>> or they are just created on the fly when the user requests a new spark
>> session?
>>
>> On Thu, Dec 14, 2023 at 10:28 AM Nikhil Goyal 
>> wrote:
>>
>>> Hi folks,
>>> I am trying to understand one question. Does Spark Connect create a new
>>> driver in the backend for every user or there are a fixed number of drivers
>>> running to which requests are sent to?
>>>
>>> Thanks
>>> Nikhil
>>>
>>


Re: [FYI] SPARK-45981: Improve Python language test coverage

2023-12-02 Thread Hyukjin Kwon
Awesome!

On Sat, Dec 2, 2023 at 2:33 PM Dongjoon Hyun 
wrote:

> Hi, All.
>
> As a part of Apache Spark 4.0.0 (SPARK-44111), the Apache Spark community
> starts to have test coverage for all supported Python versions from Today.
>
> - https://github.com/apache/spark/actions/runs/7061665420
>
> Here is a summary.
>
> 1. Main CI: All PRs and commits on `master` branch are tested with Python
> 3.9.
> 2. Daily CI:
> https://github.com/apache/spark/actions/workflows/build_python.yml
> - PyPy 3.8
> - Python 3.10
> - Python 3.11
> - Python 3.12
>
> This is a great addition for PySpark 4.0+ users and an extensible
> framework for all future Python versions.
>
> Thank you all for making this together!
>
> Best,
> Dongjoon.
>


Re: Introducing English SDK for Apache Spark - Seeking Your Feedback and Contributions

2023-07-03 Thread Hyukjin Kwon
The demo was really amazing.

On Tue, 4 Jul 2023 at 09:17, Farshid Ashouri 
wrote:

> This is wonderful news!
>
> On Tue, 4 Jul 2023 at 01:14, Gengliang Wang  wrote:
>
>> Dear Apache Spark community,
>>
>> We are delighted to announce the launch of a groundbreaking tool that
>> aims to make Apache Spark more user-friendly and accessible - the
>> English SDK . Powered by
>> the application of Generative AI, the English SDK
>>  allows you to execute
>> complex tasks with simple English instructions. This exciting news was 
>> announced
>> recently at the Data+AI Summit
>>  and also introduced
>> through a detailed blog post
>> 
>> .
>>
>> Now, we need your invaluable feedback and contributions. The aim of the
>> English SDK is not only to simplify and enrich your Apache Spark experience
>> but also to grow with the community. We're calling upon Spark developers
>> and users to explore this innovative tool, offer your insights, provide
>> feedback, and contribute to its evolution.
>>
>> You can find more details about the SDK and usage examples on the GitHub
>> repository https://github.com/databrickslabs/pyspark-ai/. If you have
>> any feedback or suggestions, please feel free to open an issue directly on
>> the repository. We are actively monitoring the issues and value your
>> insights.
>>
>> We also welcome pull requests and are eager to see how you might extend
>> or refine this tool. Let's come together to continue making Apache Spark
>> more approachable and user-friendly.
>>
>> Thank you in advance for your attention and involvement. We look forward
>> to hearing your thoughts and seeing your contributions!
>>
>> Best,
>> Gengliang Wang
>>
> --
>
>
> *Farshid Ashouri*,
> Senior Vice President,
> J.P. Morgan & Chase Co.
> +44 7932 650 788
>
>


Re: [ANNOUNCE] Apache Spark 3.4.1 released

2023-06-23 Thread Hyukjin Kwon
Thanks!

On Sat, Jun 24, 2023 at 11:01 AM Mridul Muralidharan 
wrote:

>
> Thanks Dongjoon !
>
> Regards,
> Mridul
>
> On Fri, Jun 23, 2023 at 6:58 PM Dongjoon Hyun  wrote:
>
>> We are happy to announce the availability of Apache Spark 3.4.1!
>>
>> Spark 3.4.1 is a maintenance release containing stability fixes. This
>> release is based on the branch-3.4 maintenance branch of Spark. We
>> strongly
>> recommend all 3.4 users to upgrade to this stable release.
>>
>> To download Spark 3.4.1, head over to the download page:
>> https://spark.apache.org/downloads.html
>>
>> To view the release notes:
>> https://spark.apache.org/releases/spark-release-3-4-1.html
>>
>> We would like to acknowledge all community members for contributing to
>> this
>> release. This release would not have been possible without you.
>>
>>
>> Dongjoon Hyun
>>
>


Re: Slack for PySpark users

2023-03-27 Thread Hyukjin Kwon
Yeah, actually I think we should better have a slack channel so we can
easily discuss with users and developers.

On Tue, 28 Mar 2023 at 03:08, keen  wrote:

> Hi all,
> I really like *Slack *as communication channel for a tech community.
> There is a Slack workspace for *delta lake users* (
> https://go.delta.io/slack) that I enjoy a lot.
> I was wondering if there is something similar for PySpark users.
>
> If not, would there be anything wrong with creating a new Slack workspace
> for PySpark users? (when explicitly mentioning that this is *not*
> officially part of Apache Spark)?
>
> Cheers
> Martin
>


Re: [ANNOUNCE] Apache Spark 3.3.1 released

2022-10-26 Thread Hyukjin Kwon
Thanks, Yuming.

On Wed, 26 Oct 2022 at 16:01, L. C. Hsieh  wrote:

> Thank you for driving the release of Apache Spark 3.3.1, Yuming!
>
> On Tue, Oct 25, 2022 at 11:38 PM Dongjoon Hyun 
> wrote:
> >
> > It's great. Thank you so much, Yuming!
> >
> > Dongjoon
> >
> > On Tue, Oct 25, 2022 at 11:23 PM Yuming Wang  wrote:
> >>
> >> We are happy to announce the availability of Apache Spark 3.3.1!
> >>
> >> Spark 3.3.1 is a maintenance release containing stability fixes. This
> >> release is based on the branch-3.3 maintenance branch of Spark. We
> strongly
> >> recommend all 3.3 users to upgrade to this stable release.
> >>
> >> To download Spark 3.3.1, head over to the download page:
> >> https://spark.apache.org/downloads.html
> >>
> >> To view the release notes:
> >> https://spark.apache.org/releases/spark-release-3-3-1.html
> >>
> >> We would like to acknowledge all community members for contributing to
> this
> >> release. This release would not have been possible without you.
> >>
> >>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: [Feature Request] make unix_micros() and unix_millis() available in PySpark (pyspark.sql.functions)

2022-10-17 Thread Hyukjin Kwon
You can workaround it by leveraging expr, e.g., expr("unix_micros(col)")
for now.
Should better have Scala binding first before we have Python one FWIW,

On Sat, 15 Oct 2022 at 06:19, Martin  wrote:

> Hi everyone,
>
> In *Spark SQL* there are several timestamp related functions
>
>- unix_micros(timestamp)
>Returns the number of microseconds since 1970-01-01 00:00:00 UTC.
>- unix_millis(timestamp)
>Returns the number of milliseconds since 1970-01-01 00:00:00 UTC.
>Truncates higher levels of precision.
>
> See https://spark.apache.org/docs/latest/sql-ref-functions-builtin.html
>
> Currently these are *"missing" in pyspark.sql.functions*.
>
> https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/functions.html#datetime-functions
>
> I'd appreciate it if these were also available in PySpark.
>
> Cheers,
> Martin
>


Re: Stickers and Swag

2022-06-14 Thread Hyukjin Kwon
Woohoo

On Tue, 14 Jun 2022 at 15:04, Xiao Li  wrote:

> Hi, all,
>
> The ASF has an official store at RedBubble
>  that Apache Community
> Development (ComDev) runs. If you are interested in buying Spark Swag, 70
> products featuring the Spark logo are available:
> https://www.redbubble.com/shop/ap/113203780
>
> Go Spark!
>
> Xiao
>


Re: Conda Python Env in K8S

2021-12-24 Thread Hyukjin Kwon
Can you share the logs, settings, environment, etc. and file a JIRA? There
are integration test cases for K8S support, and I myself also tested it
before.
It would be helpful if you try what I did at
https://databricks.com/blog/2020/12/22/how-to-manage-python-dependencies-in-pyspark.html
and see if it works.

On Mon, 6 Dec 2021 at 17:22, Bode, Meikel, NMA-CFD <
meikel.b...@bertelsmann.de> wrote:

> Hi Mich,
>
>
>
> Thanks for your response. Yes –py-files options works. I also tested it.
>
> The question is why the –archives option doesn’t?
>
>
>
> From Jira I can see that it should be available since 3.1.0:
>
>
>
> https://issues.apache.org/jira/browse/SPARK-33530
>
> https://issues.apache.org/jira/browse/SPARK-33615
>
>
>
> Best,
>
> Meikel
>
>
>
>
>
> *From:* Mich Talebzadeh 
> *Sent:* Samstag, 4. Dezember 2021 18:36
> *To:* Bode, Meikel, NMA-CFD 
> *Cc:* dev ; user@spark.apache.org
> *Subject:* Re: Conda Python Env in K8S
>
>
>
>
> Hi Meikel
>
>
>
> In the past I tried with
>
>
>
>--py-files
> hdfs://$HDFS_HOST:$HDFS_PORT/minikube/codes/DSBQ.zip \
>
>--archives
> hdfs://$HDFS_HOST:$HDFS_PORT/minikube/codes/pyspark_venv.zip#pyspark_venv \
>
>
>
> which is basically what you are doing. the first line --py-files works but
> the second one fails
>
>
>
> It tried to unpack them ? It tries to unpack them
>
>
>
> Unpacking an archive hdfs://
> 50.140.197.220:9000/minikube/codes/pyspark_venv.zip#pyspark_venv
> 
>  from
> /tmp/spark-502a5b57-0fe6-45bd-867d-9738e678e9a3/pyspark_venv.zip to
> /opt/spark/work-dir/./pyspark_venv
>
>
>
> But it failed.
>
>
>
> This could be due to creating the virtual environment inside the docker in
> the work-dir *o*r sometimes when there is not enough available memory to
> gunzip and untar the file, especially if your executors are built on
> cluster nodes with less memory than the driver node.
>
>
>
> However, The most convenient way to add additional packages to the docker
> image is to add them directly to the docker image at time of creating the
> image. So external packages are bundled as a part of my docker image
> because it is fixed and if an application requires those set of
> dependencies every time, they are there. Also note that every time you put
> RUN statement it creates an intermediate container and hence it increases
> build time. So reduce it as follows
>
> RUN pip install pyyaml numpy cx_Oracle --no-cache-dir
>
> The --no-cheche-dir option to pip is to prevent the downloaded binaries from 
> being added to the image, reducing the image size. It is also advisable to 
> install all packages in one line. Every time you put RUN statement it creates 
> an intermediate container and hence it increases the build time. So reduce it 
> by putting all packages in one line.
>
> Log in to the docker image and check for Python packages installed
>
> docker run -u 0 -it 
> spark/spark-py:3.1.1-scala_2.12-8-jre-slim-buster_java8PlusPackages bash
>
> root@5bc049af7278:/opt/spark/work-dir# pip list
>
> PackageVersion
>
> -- ---
>
> cx-Oracle  8.3.0
>
> numpy  1.21.4
>
> pip21.3.1
>
> PyYAML 6.0
>
> setuptools 59.4.0
>
> wheel  0.34.2
>
> HTH
>
>
>
>view my Linkedin profile
> 
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
>
>
>
> On Sat, 4 Dec 2021 at 07:52, Bode, Meikel, NMA-CFD <
> meikel.b...@bertelsmann.de> wrote:
>
> Hi Mich,
>
>
>
> sure thats possible. But distributing the complete env would be more
> practical.
>
> A workaround at the moment is, that we build different environments and
> store them in a pv and then we mount it into the pods and refer from the
> SparkApplication resource to the desired env..
>
>
>
> But actually these options exist and I want to understand what the issue
> is…
>
> Any hints on that?
>
>
>
> Best,
>
> Meikel
>
>
>
> *From:* Mich Talebzadeh 
> *Sent:* 

Re: [R] SparkR on conda-forge

2021-12-19 Thread Hyukjin Kwon
Awesome!

On Mon, 20 Dec 2021 at 09:43, yonghua  wrote:

> Nice release. thanks for sharing.
>
> On 2021/12/20 3:55, Maciej wrote:
> > FYI ‒ thanks to good folks from conda-forge we have now these:
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: PySpark: toPandas() vs collect() execution graph differences

2021-11-12 Thread Hyukjin Kwon
To add some more, if the number of rows to collect is large,
DataFrame.collect can be slower because it launches multiple Spark jobs
sequentially.

Given that DataFrame.toPandas does not take how many rows to collect, it's
controversial to apply the same optimization of DataFrame.collect to here.

We could have a configuration to enable and disable but the implementation
of this in DataFrame.toPandas would be complicated due to existing
optimization such as Arrow. Haven't taken a deeper look though but my guts
say it's not worthwhile.

On Sat, Nov 13, 2021 at 12:05 PM Hyukjin Kwon  wrote:

> Thanks for pinging me Sean.
>
> Yes, there's an optimization on DataFrame.collect which tries to collect
> few first partitioms and see if the number of rows are found (and repeat).
>
> DataFrame.toPandas does not have such optimization.
>
> I suspect that the shuffle isn't actual shuffle but just collects local
> limits on executors to one executor to calculate global limit.
>
> On Fri, Nov 12, 2021 at 11:16 PM Sean Owen  wrote:
>
>> Hyukjin can you weigh in?
>> Is this exchange due to something in your operations or clearly unique to
>> the toPandas operation?
>> I didn't think it worked that way, but maybe there is some good reason it
>> does.
>>
>> On Fri, Nov 12, 2021 at 7:34 AM Sergey Ivanychev <
>> sergeyivanyc...@gmail.com> wrote:
>>
>>> Hi Sean,
>>>
>>> According to the plan I’m observing, this is what happens indeed.
>>> There’s exchange operation that sends data to a single partition/task in
>>> toPandas() + PyArrow enabled case.
>>>
>>> 12 нояб. 2021 г., в 16:31, Sean Owen  написал(а):
>>>
>>> Yes, none of the responses are addressing your question.
>>> I do not think it's a bug necessarily; do you end up with one partition
>>> in your execution somewhere?
>>>
>>> On Fri, Nov 12, 2021 at 3:38 AM Sergey Ivanychev <
>>> sergeyivanyc...@gmail.com> wrote:
>>>
>>>> Of course if I give 64G of ram to each executor they will work. But
>>>> what’s the point? Collecting results in the driver should cause a high RAM
>>>> usage in the driver and that’s what is happening in collect() case. In the
>>>> case where pyarrow serialization is enabled all the data is being collected
>>>> on a single executor, which is clearly a wrong way to collect the result on
>>>> the driver.
>>>>
>>>> I guess I’ll open an issue about it in Spark Jira. It clearly looks
>>>> like a bug.
>>>>
>>>> 12 нояб. 2021 г., в 11:59, Mich Talebzadeh 
>>>> написал(а):
>>>>
>>>> OK, your findings do not imply those settings are incorrect. Those
>>>> settings will work if you set-up your k8s cluster in peer-to-peer mode with
>>>> equal amounts of RAM for each node which is common practice.
>>>>
>>>> HTH
>>>>
>>>>
>>>>view my Linkedin profile
>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>>
>>>> On Thu, 11 Nov 2021 at 21:39, Sergey Ivanychev <
>>>> sergeyivanyc...@gmail.com> wrote:
>>>>
>>>>> Yes, in fact those are the settings that cause this behaviour. If set
>>>>> to false, everything goes fine since the implementation in spark sources 
>>>>> in
>>>>> this case is
>>>>>
>>>>> pdf = pd.DataFrame.from_records(self.collect(), columns=self.columns)
>>>>>
>>>>> Best regards,
>>>>>
>>>>>
>>>>> Sergey Ivanychev
>>>>>
>>>>> 11 нояб. 2021 г., в 13:58, Mich Talebzadeh 
>>>>> написал(а):
>>>>>
>>>>> 
>>>>> Have you tried the following settings:
>>>>>
>>>>> spark.conf.set("spark.sql.execution.arrow.pysppark.enabled", "true")
>>>>>
>>>>> spark.conf.set("spark.sql.execution.arrow.pyspark.fallback.enabled","true")
>>>>>
>>>

Re: PySpark: toPandas() vs collect() execution graph differences

2021-11-12 Thread Hyukjin Kwon
Thanks for pinging me Sean.

Yes, there's an optimization on DataFrame.collect which tries to collect
few first partitioms and see if the number of rows are found (and repeat).

DataFrame.toPandas does not have such optimization.

I suspect that the shuffle isn't actual shuffle but just collects local
limits on executors to one executor to calculate global limit.

On Fri, Nov 12, 2021 at 11:16 PM Sean Owen  wrote:

> Hyukjin can you weigh in?
> Is this exchange due to something in your operations or clearly unique to
> the toPandas operation?
> I didn't think it worked that way, but maybe there is some good reason it
> does.
>
> On Fri, Nov 12, 2021 at 7:34 AM Sergey Ivanychev <
> sergeyivanyc...@gmail.com> wrote:
>
>> Hi Sean,
>>
>> According to the plan I’m observing, this is what happens indeed. There’s
>> exchange operation that sends data to a single partition/task in toPandas()
>> + PyArrow enabled case.
>>
>> 12 нояб. 2021 г., в 16:31, Sean Owen  написал(а):
>>
>> Yes, none of the responses are addressing your question.
>> I do not think it's a bug necessarily; do you end up with one partition
>> in your execution somewhere?
>>
>> On Fri, Nov 12, 2021 at 3:38 AM Sergey Ivanychev <
>> sergeyivanyc...@gmail.com> wrote:
>>
>>> Of course if I give 64G of ram to each executor they will work. But
>>> what’s the point? Collecting results in the driver should cause a high RAM
>>> usage in the driver and that’s what is happening in collect() case. In the
>>> case where pyarrow serialization is enabled all the data is being collected
>>> on a single executor, which is clearly a wrong way to collect the result on
>>> the driver.
>>>
>>> I guess I’ll open an issue about it in Spark Jira. It clearly looks like
>>> a bug.
>>>
>>> 12 нояб. 2021 г., в 11:59, Mich Talebzadeh 
>>> написал(а):
>>>
>>> OK, your findings do not imply those settings are incorrect. Those
>>> settings will work if you set-up your k8s cluster in peer-to-peer mode with
>>> equal amounts of RAM for each node which is common practice.
>>>
>>> HTH
>>>
>>>
>>>view my Linkedin profile
>>> 
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Thu, 11 Nov 2021 at 21:39, Sergey Ivanychev <
>>> sergeyivanyc...@gmail.com> wrote:
>>>
 Yes, in fact those are the settings that cause this behaviour. If set
 to false, everything goes fine since the implementation in spark sources in
 this case is

 pdf = pd.DataFrame.from_records(self.collect(), columns=self.columns)

 Best regards,


 Sergey Ivanychev

 11 нояб. 2021 г., в 13:58, Mich Talebzadeh 
 написал(а):

 
 Have you tried the following settings:

 spark.conf.set("spark.sql.execution.arrow.pysppark.enabled", "true")

 spark.conf.set("spark.sql.execution.arrow.pyspark.fallback.enabled","true")

 HTH

view my Linkedin profile
 


 *Disclaimer:* Use it at your own risk. Any and all responsibility for
 any loss, damage or destruction of data or any other property which may
 arise from relying on this email's technical content is explicitly
 disclaimed. The author will in no case be liable for any monetary damages
 arising from such loss, damage or destruction.




 On Thu, 4 Nov 2021 at 18:06, Mich Talebzadeh 
 wrote:

> Ok so it boils down on how spark does create toPandas() DF under the
> bonnet. How many executors are involved in k8s cluster. In this model 
> spark
> will create executors = no of nodes - 1
>
> On Thu, 4 Nov 2021 at 17:42, Sergey Ivanychev <
> sergeyivanyc...@gmail.com> wrote:
>
>> > Just to confirm with Collect() alone, this is all on the driver?
>>
>> I shared the screenshot with the plan in the first email. In the
>> collect() case the data gets fetched to the driver without problems.
>>
>> Best regards,
>>
>>
>> Sergey Ivanychev
>>
>> 4 нояб. 2021 г., в 20:37, Mich Talebzadeh 
>> написал(а):
>>
>> Just to confirm with Collect() alone, this is all on the driver?
>>
>> --
>
>
>view my Linkedin profile
> 
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for
> any loss, damage or destruction of data or any other property which may
> arise from relying on this email's technical content is explicitly
> disclaimed. The author will in no case be liable for any monetary 

Re: [ANNOUNCE] Apache Spark 3.1.2 released

2021-06-01 Thread Hyukjin Kwon
awesome!

2021년 6월 2일 (수) 오전 9:59, Dongjoon Hyun 님이 작성:

> We are happy to announce the availability of Spark 3.1.2!
>
> Spark 3.1.2 is a maintenance release containing stability fixes. This
> release is based on the branch-3.1 maintenance branch of Spark. We strongly
> recommend all 3.1 users to upgrade to this stable release.
>
> To download Spark 3.1.2, head over to the download page:
> https://spark.apache.org/downloads.html
>
> To view the release notes:
> https://spark.apache.org/releases/spark-release-3-1-2.html
>
> We would like to acknowledge all community members for contributing to this
> release. This release would not have been possible without you.
>
> Dongjoon Hyun
>


Re: [ANNOUNCE] Announcing Apache Spark 3.1.1

2021-03-03 Thread Hyukjin Kwon
Thank you so much guys .. it indeed took a long time and it was pretty
tough this time :-).
It was all possible because of your guys' support. I sincerely appreciate
it .

2021년 3월 4일 (목) 오전 2:26, Dongjoon Hyun 님이 작성:

> It took a long time. Thank you, Hyukjin and all!
>
> Bests,
> Dongjoon.
>
> On Wed, Mar 3, 2021 at 3:23 AM Gabor Somogyi 
> wrote:
>
>> Good to hear and great work Hyukjin! 
>>
>> On Wed, 3 Mar 2021, 11:15 Jungtaek Lim, 
>> wrote:
>>
>>> Thanks Hyukjin for driving the huge release, and thanks everyone for
>>> contributing the release!
>>>
>>> On Wed, Mar 3, 2021 at 6:54 PM angers zhu  wrote:
>>>
>>>> Great work, Hyukjin !
>>>>
>>>> Bests,
>>>> Angers
>>>>
>>>> Wenchen Fan  于2021年3月3日周三 下午5:02写道:
>>>>
>>>>> Great work and congrats!
>>>>>
>>>>> On Wed, Mar 3, 2021 at 3:51 PM Kent Yao  wrote:
>>>>>
>>>>>> Congrats, all!
>>>>>>
>>>>>> Bests,
>>>>>> *Kent Yao *
>>>>>> @ Data Science Center, Hangzhou Research Institute, NetEase Corp.
>>>>>> *a spark enthusiast*
>>>>>> *kyuubi <https://github.com/yaooqinn/kyuubi>is a
>>>>>> unified multi-tenant JDBC interface for large-scale data processing and
>>>>>> analytics, built on top of Apache Spark <http://spark.apache.org/>.*
>>>>>> *spark-authorizer <https://github.com/yaooqinn/spark-authorizer>A
>>>>>> Spark SQL extension which provides SQL Standard Authorization for 
>>>>>> **Apache
>>>>>> Spark <http://spark.apache.org/>.*
>>>>>> *spark-postgres <https://github.com/yaooqinn/spark-postgres> A
>>>>>> library for reading data from and transferring data to Postgres / 
>>>>>> Greenplum
>>>>>> with Spark SQL and DataFrames, 10~100x faster.*
>>>>>> *spark-func-extras <https://github.com/yaooqinn/spark-func-extras>A
>>>>>> library that brings excellent and useful functions from various modern
>>>>>> database management systems to Apache Spark <http://spark.apache.org/>.*
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 03/3/2021 15:11,Takeshi Yamamuro
>>>>>>  wrote:
>>>>>>
>>>>>> Great work and Congrats, all!
>>>>>>
>>>>>> Bests,
>>>>>> Takeshi
>>>>>>
>>>>>> On Wed, Mar 3, 2021 at 2:18 PM Mridul Muralidharan 
>>>>>> wrote:
>>>>>>
>>>>>>>
>>>>>>> Thanks Hyukjin and congratulations everyone on the release !
>>>>>>>
>>>>>>> Regards,
>>>>>>> Mridul
>>>>>>>
>>>>>>> On Tue, Mar 2, 2021 at 8:54 PM Yuming Wang  wrote:
>>>>>>>
>>>>>>>> Great work, Hyukjin!
>>>>>>>>
>>>>>>>> On Wed, Mar 3, 2021 at 9:50 AM Hyukjin Kwon 
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> We are excited to announce Spark 3.1.1 today.
>>>>>>>>>
>>>>>>>>> Apache Spark 3.1.1 is the second release of the 3.x line. This
>>>>>>>>> release adds
>>>>>>>>> Python type annotations and Python dependency management support
>>>>>>>>> as part of Project Zen.
>>>>>>>>> Other major updates include improved ANSI SQL compliance support,
>>>>>>>>> history server support
>>>>>>>>> in structured streaming, the general availability (GA) of
>>>>>>>>> Kubernetes and node decommissioning
>>>>>>>>> in Kubernetes and Standalone. In addition, this release continues
>>>>>>>>> to focus on usability, stability,
>>>>>>>>> and polish while resolving around 1500 tickets.
>>>>>>>>>
>>>>>>>>> We'd like to thank our contributors and users for their
>>>>>>>>> contributions and early feedback to
>>>>>>>>> this release. This release would not have been possible without
>>>>>>>>> you.
>>>>>>>>>
>>>>>>>>> To download Spark 3.1.1, head over to the download page:
>>>>>>>>> http://spark.apache.org/downloads.html
>>>>>>>>>
>>>>>>>>> To view the release notes:
>>>>>>>>> https://spark.apache.org/releases/spark-release-3-1-1.html
>>>>>>>>>
>>>>>>>>>
>>>>>>
>>>>>> --
>>>>>> ---
>>>>>> Takeshi Yamamuro
>>>>>>
>>>>>>


[ANNOUNCE] Announcing Apache Spark 3.1.1

2021-03-02 Thread Hyukjin Kwon
We are excited to announce Spark 3.1.1 today.

Apache Spark 3.1.1 is the second release of the 3.x line. This release adds
Python type annotations and Python dependency management support as part of
Project Zen.
Other major updates include improved ANSI SQL compliance support, history
server support
in structured streaming, the general availability (GA) of Kubernetes and
node decommissioning
in Kubernetes and Standalone. In addition, this release continues to focus
on usability, stability,
and polish while resolving around 1500 tickets.

We'd like to thank our contributors and users for their contributions and
early feedback to
this release. This release would not have been possible without you.

To download Spark 3.1.1, head over to the download page:
http://spark.apache.org/downloads.html

To view the release notes:
https://spark.apache.org/releases/spark-release-3-1-1.html


Re: [SparkR] gapply with strings with arrow

2020-10-10 Thread Hyukjin Kwon
If it works without Arrow optimization, it's likely a bug. Please feel free
to file a JIRA for that.

On Wed, 7 Oct 2020, 22:44 Jacek Pliszka,  wrote:

> Hi!
>
> Is there any place I can find information how to use gapply with arrow?
>
> I've tried something very simple
>
> collect(gapply(
>   df,
>   c("ColumnA"),
>   function(key, x){
>   data.frame(out=c("dfs"), stringAsFactors=FALSE)
>   },
>   "out String"
> ))
>
> But it fails - similar code with integers or double works fine.
>
> [Fetched stdout timeout] Error in readBin(con, raw(),
> as.integer(dataLen), endian = "big") : invalid 'n' argument
>
> java.lang.UnsupportedOperationException at
>
> org.apache.spark.sql.vectorized.ArrowColumnVector$ArrowVectorAccessor.getUTF8String(ArrowColumnVector.java:233)
> at
> org.apache.spark.sql.vectorized.ArrowColumnVector.getUTF8String(ArrowColumnVector.java:109)
> at
> org.apache.spark.sql.vectorized.ColumnarBatchRow.getUTF8String(ColumnarBatch.java:220)
> at
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
> Source)
>  ...
>
> When I looked at the source code there - it is all stubs.
>
> Is there a proper way to use arrow in gapply in SparkR?
>
> BR,
>
> Jacel
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Apache Spark 3.1 Preparation Status (Oct. 2020)

2020-10-03 Thread Hyukjin Kwon
Nice summary. Thanks Dongjoon. One minor correction -> I believe we dropped
R 3.5 and below at branch 2.4 as well.

On Sun, 4 Oct 2020, 09:17 Dongjoon Hyun,  wrote:

> Hi, All.
>
> As of today, master branch (Apache Spark 3.1.0) resolved
> 852+ JIRA issues and 606+ issues are 3.1.0-only patches.
> According to the 3.1.0 release window, branch-3.1 will be
> created on November 1st and enters QA period.
>
> Here are some notable updates I've been monitoring.
>
> *Language*
> 01. SPARK-25075 Support Scala 2.13
>   - Since SPARK-32926, Scala 2.13 build test has
> become a part of GitHub Action jobs.
>   - After SPARK-33044, Scala 2.13 test will be
> a part of Jenkins jobs.
> 02. SPARK-29909 Drop Python 2 and Python 3.4 and 3.5
> 03. SPARK-32082 Project Zen: Improving Python usability
>   - 7 of 16 issues are resolved.
> 04. SPARK-32073 Drop R < 3.5 support
>   - This is done for Spark 3.0.1 and 3.1.0.
>
> *Dependency*
> 05. SPARK-32058 Use Apache Hadoop 3.2.0 dependency
>   - This changes the default dist. for better cloud support
> 06. SPARK-32981 Remove hive-1.2 distribution
> 07. SPARK-20202 Remove references to org.spark-project.hive
>   - This will remove Hive 1.2.1 from source code
> 08. SPARK-29250 Upgrade to Hadoop 3.2.1 (WIP)
>
> *Core*
> 09. SPARK-27495 Support Stage level resource conf and scheduling
>   - 11 of 15 issues are resolved
> 10. SPARK-25299 Use remote storage for persisting shuffle data
>   - 8 of 14 issues are resolved
>
> *Resource Manager*
> 11. SPARK-33005 Kubernetes GA preparation
>   - It is on the way and we are waiting for more feedback.
>
> *SQL*
> 12. SPARK-30648/SPARK-32346 Support filters pushdown
>   to JSON/Avro
> 13. SPARK-32948/SPARK-32958 Add Json expression optimizer
> 14. SPARK-12312 Support JDBC Kerberos w/ keytab
>   - 11 of 17 issues are resolved
> 15. SPARK-27589 DSv2 was mostly completed in 3.0
>   and added more features in 3.1 but still we missed
>   - All built-in DataSource v2 write paths are disabled
> and v1 write is used instead.
>   - Support partition pruning with subqueries
>   - Support bucketing
>
> We still have one month before the feature freeze
> and starting QA. If you are working for 3.1,
> please consider the timeline and share your schedule
> with the Apache Spark community. For the other stuff,
> we can put it into 3.2 release scheduled in June 2021.
>
> Last not but least, I want to emphasize (7) once again.
> We need to remove the forked unofficial Hive eventually.
> Please let us know your reasons if you need to build
> from Apache Spark 3.1 source code for Hive 1.2.
>
> https://github.com/apache/spark/pull/29936
>
> As I wrote in the above PR description, for old releases,
> Apache Spark 2.4(LTS) and 3.0 (~2021.12) will provide
> Hive 1.2-based distribution.
>
> Bests,
> Dongjoon.
>


PySpark documentation main page

2020-08-01 Thread Hyukjin Kwon
Hi all,

I am trying to write up the main page of PySpark documentation at
https://github.com/apache/spark/pull/29320.

While I think the current proposal might be good enough, I would like
to collect more feedback about the contents, structure and image since
this is the entrance page of PySpark documentation.

For example, sharing a reference site is also very welcome. Let me know
if any of you guys have a good idea to share. I plan to leave it open for
some
more days.

PS: thanks @Liang-Chi Hsieh  and @Sean Owen
 for taking a look at it quickly.


Re: [PSA] Python 2, 3.4 and 3.5 are now dropped

2020-07-13 Thread Hyukjin Kwon
cc user mailing list too.

2020년 7월 14일 (화) 오전 11:27, Hyukjin Kwon 님이 작성:

> I am sending another email to make sure dev people know. Python 2, 3.4 and
> 3.5 are now dropped at https://github.com/apache/spark/pull/28957.
>
>
>


Re: Error: Vignette re-building failed. Execution halted

2020-06-24 Thread Hyukjin Kwon
Looks like you haven't installed the 'e1071' package.

2020년 6월 24일 (수) 오후 6:49, Anwar AliKhan 님이 작성:

> ./dev/make-distribution.sh --name custom-spark --pip --r --tgz -Psparkr
> -Phive -Phive-thriftserver -Pmesos -Pyarn -Pkubernetes
> 
>
>
> minor error Spark r test failed , I don't use r so it doesn't effect me.
>
> ***installing help indices
> ** building package indices
> ** installing vignettes
> ** testing if installed package can be loaded from temporary location
> ** testing if installed package can be loaded from final location
> ** testing if installed package keeps a record of temporary installation
> path
> * DONE (SparkR)
> ++ cd /opt/spark/R/lib
> ++ jar cfM /opt/spark/R/lib/sparkr.zip SparkR
> ++ popd
> ++ cd /opt/spark/R/..
> ++ pwd
> + SPARK_HOME=/opt/spark
> + . /opt/spark/bin/load-spark-env.sh
> ++ '[' -z /opt/spark ']'
> ++ SPARK_ENV_SH=spark-env.sh
> ++ '[' -z '' ']'
> ++ export SPARK_ENV_LOADED=1
> ++ SPARK_ENV_LOADED=1
> ++ export SPARK_CONF_DIR=/opt/spark/conf
> ++ SPARK_CONF_DIR=/opt/spark/conf
> ++ SPARK_ENV_SH=/opt/spark/conf/spark-env.sh
> ++ [[ -f /opt/spark/conf/spark-env.sh ]]
> ++ set -a
> ++ . /opt/spark/conf/spark-env.sh
> +++ export SPARK_LOCAL_IP=192.168.0.786
> +++ SPARK_LOCAL_IP=192.168.0.786
> ++ set +a
> ++ export SPARK_SCALA_VERSION=2.12
> ++ SPARK_SCALA_VERSION=2.12
> + '[' -f /opt/spark/RELEASE ']'
> + SPARK_JARS_DIR=/opt/spark/assembly/target/scala-2.12/jars
> + '[' -d /opt/spark/assembly/target/scala-2.12/jars ']'
> + SPARK_HOME=/opt/spark
> + /usr/bin/R CMD build /opt/spark/R/pkg
> * checking for file ‘/opt/spark/R/pkg/DESCRIPTION’ ... OK
> * preparing ‘SparkR’:
> * checking DESCRIPTION meta-information ... OK
> * installing the package to build vignettes
> * creating vignettes ... ERROR
> --- re-building ‘sparkr-vignettes.Rmd’ using rmarkdown
>
> Attaching package: 'SparkR'
>
> The following objects are masked from 'package:stats':
>
> cov, filter, lag, na.omit, predict, sd, var, window
>
> The following objects are masked from 'package:base':
>
> as.data.frame, colnames, colnames<-, drop, endsWith, intersect,
> rank, rbind, sample, startsWith, subset, summary, transform, union
>
> Picked up _JAVA_OPTIONS: -XX:-UsePerfData
> Picked up _JAVA_OPTIONS: -XX:-UsePerfData
> 20/06/24 10:23:54 WARN NativeCodeLoader: Unable to load native-hadoop
> library for your platform... using builtin-java classes where applicable
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use
> setLogLevel(newLevel).
>
> [Stage 0:>  (0 +
> 1) / 1]
>
>
>
>
> [Stage 9:=>  (88 + 1)
> / 100]
>
>
>
>
> [Stage 13:===>  (147 + 1)
> / 200]
>
>
>
> 20/06/24 10:24:04 WARN Instrumentation: [79237008] regParam is zero, which
> might cause numerical instability and overfitting.
> 20/06/24 10:24:04 WARN BLAS: Failed to load implementation from:
> com.github.fommil.netlib.NativeSystemBLAS
> 20/06/24 10:24:04 WARN BLAS: Failed to load implementation from:
> com.github.fommil.netlib.NativeRefBLAS
> 20/06/24 10:24:04 WARN LAPACK: Failed to load implementation from:
> com.github.fommil.netlib.NativeSystemLAPACK
> 20/06/24 10:24:04 WARN LAPACK: Failed to load implementation from:
> com.github.fommil.netlib.NativeRefLAPACK
> 20/06/24 10:24:09 WARN package: Truncated the string representation of a
> plan since it was too large. This behavior can be adjusted by setting
> 'spark.sql.debug.maxToStringFields'.
>
> [Stage 67:>  (45 + 1)
> / 200]
>
> [Stage 67:=> (62 + 1)
> / 200]
>
> [Stage 67:==>(80 + 1)
> / 200]
>
> [Stage 67:==>(98 + 1)
> / 200]
>
> [Stage 67:==>   (114 + 1)
> / 200]
>
> [Stage 67:===>  (132 + 1)
> / 200]
>
> [Stage 67:===>  (148 + 1)
> / 200]
>
> [Stage 67:> (166 + 1)
> / 200]
>
> [Stage 67:=>(184 + 1)
> / 200]
>
>
>
>
> [Stage 69:>  (44 + 1)
> / 200]
>
> [Stage 69:>  (61 + 1)
> / 200]
>
> [Stage 69:=> (79 + 1)
> / 200]
>
> [Stage 69:==>(97 + 1)
> / 200]
>
> [Stage 69:===>  (116 + 1)
> / 200]
>
> [Stage 69:> (134 + 1)
> / 200]
>
> [Stage 69:=>

Re: [ANNOUNCE] Apache Spark 3.0.0

2020-06-18 Thread Hyukjin Kwon
Yay!

2020년 6월 19일 (금) 오전 4:46, Mridul Muralidharan 님이 작성:

> Great job everyone ! Congratulations :-)
>
> Regards,
> Mridul
>
> On Thu, Jun 18, 2020 at 10:21 AM Reynold Xin  wrote:
>
>> Hi all,
>>
>> Apache Spark 3.0.0 is the first release of the 3.x line. It builds on
>> many of the innovations from Spark 2.x, bringing new ideas as well as
>> continuing long-term projects that have been in development. This release
>> resolves more than 3400 tickets.
>>
>> We'd like to thank our contributors and users for their contributions and
>> early feedback to this release. This release would not have been possible
>> without you.
>>
>> To download Spark 3.0.0, head over to the download page:
>> http://spark.apache.org/downloads.html
>>
>> To view the release notes:
>> https://spark.apache.org/releases/spark-release-3-0-0.html
>>
>>
>>
>>


Re: [ANNOUNCE] Apache Spark 2.4.6 released

2020-06-10 Thread Hyukjin Kwon
Yay!

2020년 6월 11일 (목) 오전 10:38, Holden Karau 님이 작성:

> We are happy to announce the availability of Spark 2.4.6!
>
> Spark 2.4.6 is a maintenance release containing stability, correctness,
> and security fixes.
> This release is based on the branch-2.4 maintenance branch of Spark. We
> strongly recommend all 2.4 users to upgrade to this stable release.
>
> To download Spark 2.4.6, head over to the download page:
> http://spark.apache.org/downloads.html
> Spark 2.4.6 is also available in Maven Central, PyPI, and CRAN.
>
> Note that you might need to clear your browser cache or
> to use `Private`/`Incognito` mode according to your browsers.
>
> To view the release notes:
> https://spark.apache.org/releases/spark-release-2.4.6.html
>
> We would like to acknowledge all community members for contributing to this
> release. This release would not have been possible without you.
>


Re: [ANNOUNCE] Announcing Apache Spark 2.4.5

2020-02-10 Thread Hyukjin Kwon
Thanks Dongjoon!

2020년 2월 9일 (일) 오전 10:49, Takeshi Yamamuro 님이 작성:

> Happy to hear the release news!
>
> Bests,
> Takeshi
>
> On Sun, Feb 9, 2020 at 10:28 AM Dongjoon Hyun 
> wrote:
>
>> There was a typo in one URL. The correct release note URL is here.
>>
>> https://spark.apache.org/releases/spark-release-2-4-5.html
>>
>>
>>
>> On Sat, Feb 8, 2020 at 5:22 PM Dongjoon Hyun 
>> wrote:
>>
>>> We are happy to announce the availability of Spark 2.4.5!
>>>
>>> Spark 2.4.5 is a maintenance release containing stability fixes. This
>>> release is based on the branch-2.4 maintenance branch of Spark. We
>>> strongly
>>> recommend all 2.4 users to upgrade to this stable release.
>>>
>>> To download Spark 2.4.5, head over to the download page:
>>> http://spark.apache.org/downloads.html
>>>
>>> Note that you might need to clear your browser cache or
>>> to use `Private`/`Incognito` mode according to your browsers.
>>>
>>> To view the release notes:
>>> https://spark.apache.org/releases/spark-release-2.4.5.html
>>>
>>> We would like to acknowledge all community members for contributing to
>>> this
>>> release. This release would not have been possible without you.
>>>
>>> Dongjoon Hyun
>>>
>>
>
> --
> ---
> Takeshi Yamamuro
>


Re: Fail to use SparkR of 3.0 preview 2

2019-12-26 Thread Hyukjin Kwon
I was randomly googling out of curiosity, and seems indeed that's the
problem (
https://r.789695.n4.nabble.com/Error-in-rbind-info-getNamespaceInfo-env-quot-S3methods-quot-td4755490.html
).
Yes, seems we should make sure we build SparkR in an old version.
Since that support for R prior to version 3.4 is deprecated as of Spark
3.0.0, we could use either R 3.4 or matching to Jenkins's (R 3.1 IIRC) for
Spark 3.0 release.

Redirecting to a dev list and Yuming as well for visibility.

2019년 12월 27일 (금) 오후 12:02, Jeff Zhang 님이 작성:

> Yes, I guess so. But R 3.6.2 is just released this month, I think we
> should use an older version to build SparkR.
>
> Felix Cheung  于2019年12月27日周五 上午10:43写道:
>
>> Maybe it’s the reverse - the package is built to run in latest but not
>> compatible with slightly older (3.5.2 was Dec 2018)
>>
>> --
>> *From:* Jeff Zhang 
>> *Sent:* Thursday, December 26, 2019 5:36:50 PM
>> *To:* Felix Cheung 
>> *Cc:* user.spark 
>> *Subject:* Re: Fail to use SparkR of 3.0 preview 2
>>
>> I use R 3.5.2
>>
>> Felix Cheung  于2019年12月27日周五 上午4:32写道:
>>
>> It looks like a change in the method signature in R base packages.
>>
>> Which version of R are you running on?
>>
>> --
>> *From:* Jeff Zhang 
>> *Sent:* Thursday, December 26, 2019 12:46:12 AM
>> *To:* user.spark 
>> *Subject:* Fail to use SparkR of 3.0 preview 2
>>
>> I tried SparkR of spark 3.0 preview 2, but hit the following issue.
>>
>> Error in rbind(info, getNamespaceInfo(env, "S3methods")) :
>>   number of columns of matrices must match (see arg 2)
>> Error: package or namespace load failed for ‘SparkR’ in rbind(info,
>> getNamespaceInfo(env, "S3methods")):
>>  number of columns of matrices must match (see arg 2)
>> During startup - Warning messages:
>> 1: package ‘SparkR’ was built under R version 3.6.2
>> 2: package ‘SparkR’ in options("defaultPackages") was not found
>>
>> Does anyone know what might be wrong ? Thanks
>>
>>
>>
>> --
>> Best Regards
>>
>> Jeff Zhang
>>
>>
>>
>> --
>> Best Regards
>>
>> Jeff Zhang
>>
>
>
> --
> Best Regards
>
> Jeff Zhang
>


Re: [VOTE] Shall we release ORC 1.4.5rc1?

2019-12-06 Thread Hyukjin Kwon
+1 (as a Spark user)

2019년 12월 7일 (토) 오전 11:06, Dongjoon Hyun 님이 작성:

> +1 for Apache ORC 1.4.5 release.
>
> Thank you for making the release.
>
> I'd like to mention some notable changes here.
> Apache ORC 1.4.5 is not a drop-in replacement for 1.4.4 because of the
> following.
>
>   ORC-498: ReaderImpl and RecordReaderImpl open separate file handles.
>
> Applications should be updated accordingly. Otherwise, file system
> leakages occur.
> For example, Apache Spark 2.3.5-SNAPSHOT is currently using v1.4.4 and
> will not work with v1.4.5.
>
> In short, there is a breaking change between v1.4.4 and v1.4.5 like the
> breaking change between v1.5.5 and 1.5.6.
> For the required change, please refer Owen's Apache Spark upgrade patch.
>
>   [SPARK-28208][BUILD][SQL] Upgrade to ORC 1.5.6 including closing the
> ORC readers
>
> https://github.com/apache/spark/commit/dfb0a8bb048d43f8fd1fb05b1027bd2fc7438dbc
>
> Bests,
> Dongjoon.
>
>
> On Fri, Dec 6, 2019 at 4:19 PM Alan Gates  wrote:
>
>> +1.  Did a build on ubuntu 16, checked the signatures and hashes.
>> Reviewed
>> the license changes.
>>
>> Alan.
>>
>> On Fri, Dec 6, 2019 at 1:41 PM Owen O'Malley 
>> wrote:
>>
>> > All,
>> >Ok, I backported a few more fixes in to rc1:
>> >
>> >- ORC-480
>> >- ORC-552
>> >- ORC-576
>> >
>> >
>> > Should we release the following artifacts as ORC 1.4.5?
>> >
>> > tar: http://home.apache.org/~omalley/orc-1.4.5/
>> > tag: https://github.com/apache/orc/releases/tag/release-1.4.5rc1
>> > jiras: https://issues.apache.org/jira/browse/ORC/fixforversion/12345479
>> >
>> > Thanks!
>> >
>>
>


Re: [DISCUSS] Remove sorting of fields in PySpark SQL Row construction

2019-11-07 Thread Hyukjin Kwon
+1

2019년 11월 6일 (수) 오후 11:38, Wenchen Fan 님이 작성:

> Sounds reasonable to me. We should make the behavior consistent within
> Spark.
>
> On Tue, Nov 5, 2019 at 6:29 AM Bryan Cutler  wrote:
>
>> Currently, when a PySpark Row is created with keyword arguments, the
>> fields are sorted alphabetically. This has created a lot of confusion with
>> users because it is not obvious (although it is stated in the pydocs) that
>> they will be sorted alphabetically. Then later when applying a schema and
>> the field order does not match, an error will occur. Here is a list of some
>> of the JIRAs that I have been tracking all related to this issue:
>> SPARK-24915, SPARK-22232, SPARK-27939, SPARK-27712, and relevant discussion
>> of the issue [1].
>>
>> The original reason for sorting fields is because kwargs in python < 3.6
>> are not guaranteed to be in the same order that they were entered [2].
>> Sorting alphabetically ensures a consistent order. Matters are further
>> complicated with the flag _*from_dict*_ that allows the Row fields to to
>> be referenced by name when made by kwargs, but this flag is not serialized
>> with the Row and leads to inconsistent behavior. For instance:
>>
>> >>> spark.createDataFrame([Row(A="1", B="2")], "B string, A string").first()
>> Row(B='2', A='1')>>> 
>> spark.createDataFrame(spark.sparkContext.parallelize([Row(A="1", B="2")]), 
>> "B string, A string").first()
>> Row(B='1', A='2')
>>
>> I think the best way to fix this is to remove the sorting of fields when
>> constructing a Row. For users with Python 3.6+, nothing would change
>> because these versions of Python ensure that the kwargs stays in the
>> ordered entered. For users with Python < 3.6, using kwargs would check a
>> conf to either raise an error or fallback to a LegacyRow that sorts the
>> fields as before. With Python < 3.6 being deprecated now, this LegacyRow
>> can also be removed at the same time. There are also other ways to create
>> Rows that will not be affected. I have opened a JIRA [3] to capture this,
>> but I am wondering what others think about fixing this for Spark 3.0?
>>
>> [1] https://github.com/apache/spark/pull/20280
>> [2] https://www.python.org/dev/peps/pep-0468/
>> [3] https://issues.apache.org/jira/browse/SPARK-29748
>>
>>


Re: DataSourceV2: pushFilters() is not invoked for each read call - spark 2.3.2

2019-09-06 Thread Hyukjin Kwon
I believe this issue was fixed in Spark 2.4.

Spark DataSource V2 has been still being radically developed - It is not
complete yet until now.
So, I think the feasible option to get through at the current moment is:
  1. upgrade to higher Spark versions
  2. disable filter push down at your DataSource V2 implementation

I don't think Spark community will backport or fix things at branch-2.3
which will be EOL release soon.
For each branch, DataSource V2 has totally different codes.
Fixing those specifically in each branch will bring considerable overhead.
I believe that's usually the same case too for some internal Spark forks as
well.



2019년 9월 6일 (금) 오후 3:25, Shubham Chaurasia 님이 작성:

> Hi,
>
> I am using spark v2.3.2. I have an implementation of DSV2. Here is what is
> happening:
>
> 1) Obtained a dataframe using MyDataSource
>
> scala> val df1 = spark.read.format("com.shubham.MyDataSource").load
>> MyDataSource.MyDataSource
>> MyDataSource.createReader: Going to create a new MyDataSourceReader
>> MyDataSourceReader.MyDataSourceReader:
>> Instantiatedcom.shubham.reader.MyDataSourceReader@2b85edc7
>> MyDataSourceReader.readSchema:
>> com.shubham.reader.MyDataSourceReader@2b85edc7 baseSchema:
>> StructType(StructField(c1,IntegerType,true),
>> StructField(c2,IntegerType,true), StructField(c3,IntegerType,true))
>> df1: org.apache.spark.sql.DataFrame = [c1: int, c2: int ... 1 more field]
>>
>
> 2) show() on df1
>
>> scala> df1.show
>> MyDataSourceReader.pushedFilters: []
>> MyDataSourceReader.pushedFilters: []
>> MyDataSourceReader.pushedFilters: []
>> MyDataSourceReader.pushedFilters: []
>> MyDataSourceReader.pushedFilters: []
>> MyDataSourceReader.pushedFilters: []
>> MyDataSourceReader.pushedFilters: []
>> MyDataSourceReader.pushedFilters: []
>> MyDataSourceReader.pushedFilters: []
>> MyDataSourceReader.pushedFilters: []
>> MyDataSourceReader.pushedFilters: []
>> MyDataSourceReader.pushedFilters: []
>> MyDataSourceReader.pruneColumns:
>> StructType(StructField(c1,IntegerType,true),
>> StructField(c2,IntegerType,true), StructField(c3,IntegerType,true))
>> MyDataSourceReader.readSchema:
>> com.shubham.reader.MyDataSourceReader@2b85edc7 baseSchema:
>> StructType(StructField(c1,IntegerType,true),
>> StructField(c2,IntegerType,true), StructField(c3,IntegerType,true))
>> MyDataSourceReader.pushedFilters: []
>> MyDataSourceReader.pushedFilters: []
>> MyDataSourceReader.pushedFilters: []
>> MyDataSourceReader.pushedFilters: []
>> ===MyDataSourceReader.createBatchDataReaderFactories===
>> prunedSchema = StructType(StructField(c1,IntegerType,true),
>> StructField(c2,IntegerType,true), StructField(c3,IntegerType,true))
>> pushedFilters = []
>> ===MyDataSourceReader.createBatchDataReaderFactories===
>> +---+---+---+
>> | c1| c2| c3|
>> +---+---+---+
>> +---+---+---+
>>
>
> 3) val df2 = df1.filter($"c3" > 1)
>
>>
>> scala> df2.show
>> MyDataSourceReader.pushedFilters: []
>> MyDataSourceReader.pushedFilters: []
>> MyDataSourceReader.pushedFilters: []
>> MyDataSourceReader.pushedFilters: []
>> MyDataSourceReader.pushFilters: [IsNotNull(c3), GreaterThan(c3,1)]
>> MyDataSourceReader.pruneColumns:
>> StructType(StructField(c1,IntegerType,true),
>> StructField(c2,IntegerType,true), StructField(c3,IntegerType,true))
>> MyDataSourceReader.readSchema:
>> com.shubham.reader.MyDataSourceReader@2b85edc7 baseSchema:
>> StructType(StructField(c1,IntegerType,true),
>> StructField(c2,IntegerType,true), StructField(c3,IntegerType,true))
>> MyDataSourceReader.pushedFilters: [IsNotNull(c3), GreaterThan(c3,1)]
>> MyDataSourceReader.pushedFilters: [IsNotNull(c3), GreaterThan(c3,1)]
>> MyDataSourceReader.pushedFilters: [IsNotNull(c3), GreaterThan(c3,1)]
>> MyDataSourceReader.pushedFilters: [IsNotNull(c3), GreaterThan(c3,1)]
>> ===MyDataSourceReader.createBatchDataReaderFactories===
>> prunedSchema = StructType(StructField(c1,IntegerType,true),
>> StructField(c2,IntegerType,true), StructField(c3,IntegerType,true))
>> pushedFilters = [IsNotNull(c3), GreaterThan(c3,1)]
>> ===MyDataSourceReader.createBatchDataReaderFactories===
>> +---+---+---+
>> | c1| c2| c3|
>> +---+---+---+
>> +---+---+---+
>
>
> 4) Again df1.show() <=== As df2 is derived from df1(and share same
> instance of MyDataSourceReader), this modifies pushedFilters even for df1
>
>> scala> df1.show
>> MyDataSourceReader.pushedFilters: [IsNotNull(c3), GreaterThan(c3,1)]
>> MyDataSourceReader.pushedFilters: [IsNotNull(c3), GreaterThan(c3,1)]
>> MyDataSourceReader.pushedFilters: [IsNotNull(c3), GreaterThan(c3,1)]
>> MyDataSourceReader.pushedFilters: [IsNotNull(c3), GreaterThan(c3,1)]
>> MyDataSourceReader.pushedFilters: [IsNotNull(c3), GreaterThan(c3,1)]
>> MyDataSourceReader.pushedFilters: [IsNotNull(c3), GreaterThan(c3,1)]
>> MyDataSourceReader.pushedFilters: [IsNotNull(c3), GreaterThan(c3,1)]
>> MyDataSourceReader.pushedFilters: [IsNotNull(c3), GreaterThan(c3,1)]
>> MyDataSourceReader.pushedFilters: [IsNotNull(c3), GreaterThan(c3,1)]
>> 

Re: [ANNOUNCE] Announcing Apache Spark 2.4.4

2019-09-01 Thread Hyukjin Kwon
YaY!

2019년 9월 2일 (월) 오후 1:27, Wenchen Fan 님이 작성:

> Great! Thanks!
>
> On Mon, Sep 2, 2019 at 5:55 AM Dongjoon Hyun 
> wrote:
>
>> We are happy to announce the availability of Spark 2.4.4!
>>
>> Spark 2.4.4 is a maintenance release containing stability fixes. This
>> release is based on the branch-2.4 maintenance branch of Spark. We
>> strongly
>> recommend all 2.4 users to upgrade to this stable release.
>>
>> To download Spark 2.4.4, head over to the download page:
>> http://spark.apache.org/downloads.html
>>
>> Note that you might need to clear your browser cache or
>> to use `Private`/`Incognito` mode according to your browsers.
>>
>> To view the release notes:
>> https://spark.apache.org/releases/spark-release-2-4-4.html
>>
>> We would like to acknowledge all community members for contributing to
>> this
>> release. This release would not have been possible without you.
>>
>> Dongjoon Hyun
>>
>


Re: Release Apache Spark 2.4.4

2019-08-14 Thread Hyukjin Kwon
Adding Shixiong

WDYT?

2019년 8월 14일 (수) 오후 2:30, Terry Kim 님이 작성:

> Can the following be included?
>
> [SPARK-27234][SS][PYTHON] Use InheritableThreadLocal for current epoch in
> EpochTracker (to support Python UDFs)
> 
>
> Thanks,
> Terry
>
> On Tue, Aug 13, 2019 at 10:24 PM Wenchen Fan  wrote:
>
>> +1
>>
>> On Wed, Aug 14, 2019 at 12:52 PM Holden Karau 
>> wrote:
>>
>>> +1
>>> Does anyone have any critical fixes they’d like to see in 2.4.4?
>>>
>>> On Tue, Aug 13, 2019 at 5:22 PM Sean Owen  wrote:
>>>
 Seems fine to me if there are enough valuable fixes to justify another
 release. If there are any other important fixes imminent, it's fine to
 wait for those.


 On Tue, Aug 13, 2019 at 6:16 PM Dongjoon Hyun 
 wrote:
 >
 > Hi, All.
 >
 > Spark 2.4.3 was released three months ago (8th May).
 > As of today (13th August), there are 112 commits (75 JIRAs) in
 `branch-24` since 2.4.3.
 >
 > It would be great if we can have Spark 2.4.4.
 > Shall we start `2.4.4 RC1` next Monday (19th August)?
 >
 > Last time, there was a request for K8s issue and now I'm waiting for
 SPARK-27900.
 > Please let me know if there is another issue.
 >
 > Thanks,
 > Dongjoon.

 -
 To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

 --
>>> Twitter: https://twitter.com/holdenkarau
>>> Books (Learning Spark, High Performance Spark, etc.):
>>> https://amzn.to/2MaRAG9  
>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>
>>


Re: Continuous processing mode and python udf

2019-08-13 Thread Hyukjin Kwon
that's fixed in
https://github.com/apache/spark/commit/b83b7927b3a85c1a4945e2224ed811b5bb804477

2019년 8월 13일 (화) 오후 12:37, zenglong chen 님이 작성:

> Does Spark 2.4.0 support Python UDFs with Continuous Processing mode?
> I try it and occur error like below:
>  WARN scheduler.TaskSetManager: Lost task 4.0 in stage 0.0 (TID 4,
> 172.22.9.179, executor 1): java.util.NoSuchElementException: None.get
> at scala.None$.get(Option.scala:347)
> at scala.None$.get(Option.scala:345)
> at
> org.apache.spark.sql.execution.streaming.continuous.ContinuousQueuedDataReader.next(ContinuousQueuedDataReader.scala:116)
> at
> org.apache.spark.sql.execution.streaming.continuous.ContinuousDataSourceRDD$$anon$1.getNext(ContinuousDataSourceRDD.scala:83)
> at
> org.apache.spark.sql.execution.streaming.continuous.ContinuousDataSourceRDD$$anon$1.getNext(ContinuousDataSourceRDD.scala:81)
> at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
> at
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
> Source)
> at
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
> at scala.collection.Iterator$GroupedIterator.fill(Iterator.scala:1124)
> at scala.collection.Iterator$GroupedIterator.hasNext(Iterator.scala:1130)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
> at scala.collection.Iterator$class.foreach(Iterator.scala:891)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
> at
> org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:224)
> at
> org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$2.writeIteratorToStream(PythonUDFRunner.scala:50)
> at
> org.apache.spark.api.python.BasePythonRunner$WriterThread$$anonfun$run$1.apply(PythonRunner.scala:345)
> at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1945)
> at
> org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:194)
>


Re: Release Apache Spark 2.4.4

2019-08-13 Thread Hyukjin Kwon
+1

2019년 8월 14일 (수) 오전 9:13, Takeshi Yamamuro 님이 작성:

> Hi,
>
> Thanks for your notification, Dongjoon!
> I put some links for the other committers/PMCs to access the info easily:
>
> A commit list in github from the last release:
> https://github.com/apache/spark/compare/5ac2014e6c118fbeb1fe8e5c8064c4a8ee9d182a...branch-2.4
> A issue list in jira:
> https://issues.apache.org/jira/projects/SPARK/versions/12345466#release-report-tab-body
> The 5 correctness issues resolved in branch-2.4:
>
> https://issues.apache.org/jira/browse/SPARK-27798?jql=project%20%3D%2012315420%20AND%20fixVersion%20%3D%2012345466%20AND%20labels%20in%20(%27correctness%27)%20ORDER%20BY%20priority%20DESC%2C%20key%20ASC
>
> Anyway, +1
>
> Best,
> Takeshi
>
> On Wed, Aug 14, 2019 at 8:25 AM DB Tsai  wrote:
>
>> +1
>>
>> On Tue, Aug 13, 2019 at 4:16 PM Dongjoon Hyun 
>> wrote:
>> >
>> > Hi, All.
>> >
>> > Spark 2.4.3 was released three months ago (8th May).
>> > As of today (13th August), there are 112 commits (75 JIRAs) in
>> `branch-24` since 2.4.3.
>> >
>> > It would be great if we can have Spark 2.4.4.
>> > Shall we start `2.4.4 RC1` next Monday (19th August)?
>> >
>> > Last time, there was a request for K8s issue and now I'm waiting for
>> SPARK-27900.
>> > Please let me know if there is another issue.
>> >
>> > Thanks,
>> > Dongjoon.
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
>
> --
> ---
> Takeshi Yamamuro
>


Re: Usage of PyArrow in Spark

2019-07-17 Thread Hyukjin Kwon
Regular Python UDFs don't use PyArrow under the hood.
Yes, they can potentially benefit but they can be easily worked around via
Pandas UDFs.

For instance, both below are virtually identical.

@udf(...)
def func(col):
return col

@pandas_udf(...)
def pandas_func(col):
return a.apply(lambda col: col)

If we only need some minimised change, I would be positive about adding
Arrow support into regular Python UDFs. Otherwise, I am not sure yet.


2019년 7월 17일 (수) 오후 1:19, Abdeali Kothari 님이 작성:

> Hi,
> In spark 2.3+ I saw that pyarrow was being used in a bunch of places in
> spark. And I was trying to understand the benefit in terms of serialization
> / deserializaiton it provides.
>
> I understand that the new pandas-udf works only if pyarrow is installed.
> But what about the plain old PythonUDF which can be used in map() kind of
> operations?
> Are they also using pyarrow under the hood to reduce the cost is serde? Or
> do they remain as earlier and no performance gain should be expected in
> those?
>
> If I'm not mistaken, plain old PythonUDFs could also benefit from Arrow as
> the data transfer cost to serialize/deserialzie from Java to Python and
> back still exists and could potentially be reduced by using Arrow?
> Is my understanding correct? Are there any plans to implement this?
>
> Pointers to any notes or Jira about this would be appreciated.
>


Re: Exposing JIRA issue types at GitHub PRs

2019-06-16 Thread Hyukjin Kwon
Labels look good and useful.

On Sat, 15 Jun 2019, 02:36 Dongjoon Hyun,  wrote:

> Now, you can see the exposed component labels (ordered by the number of
> PRs) here and click the component to search.
>
> https://github.com/apache/spark/labels?sort=count-desc
>
> Dongjoon.
>
>
> On Fri, Jun 14, 2019 at 1:15 AM Dongjoon Hyun 
> wrote:
>
>> Hi, All.
>>
>> JIRA and PR is ready for reviews.
>>
>> https://issues.apache.org/jira/browse/SPARK-28051 (Exposing JIRA issue
>> component types at GitHub PRs)
>> https://github.com/apache/spark/pull/24871
>>
>> Bests,
>> Dongjoon.
>>
>>
>> On Thu, Jun 13, 2019 at 10:48 AM Dongjoon Hyun 
>> wrote:
>>
>>> Thank you for the feedbacks and requirements, Hyukjin, Reynold, Marco.
>>>
>>> Sure, we can do whatever we want.
>>>
>>> I'll wait for more feedbacks and proceed to the next steps.
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>>
>>> On Wed, Jun 12, 2019 at 11:51 PM Marco Gaido 
>>> wrote:
>>>
 Hi Dongjoon,
 Thanks for the proposal! I like the idea. Maybe we can extend it to
 component too and to some jira labels such as correctness which may be
 worth to highlight in PRs too. My only concern is that in many cases JIRAs
 are created not very carefully so they may be incorrect at the moment of
 the pr creation and it may be updated later: so keeping them in sync may be
 an extra effort..

 On Thu, 13 Jun 2019, 08:09 Reynold Xin,  wrote:

> Seems like a good idea. Can we test this with a component first?
>
> On Thu, Jun 13, 2019 at 6:17 AM Dongjoon Hyun 
> wrote:
>
>> Hi, All.
>>
>> Since we use both Apache JIRA and GitHub actively for Apache Spark
>> contributions, we have lots of JIRAs and PRs consequently. One specific
>> thing I've been longing to see is `Jira Issue Type` in GitHub.
>>
>> How about exposing JIRA issue types at GitHub PRs as GitHub `Labels`?
>> There are two main benefits:
>> 1. It helps the communication between the contributors and reviewers
>> with more information.
>> (In some cases, some people only visit GitHub to see the PR and
>> commits)
>> 2. `Labels` is searchable. We don't need to visit Apache Jira to
>> search PRs to see a specific type.
>> (For example, the reviewers can see and review 'BUG' PRs first by
>> using `is:open is:pr label:BUG`.)
>>
>> Of course, this can be done automatically without human intervention.
>> Since we already have GitHub Jenkins job to access JIRA/GitHub, that job
>> can add the labels from the beginning. If needed, I can volunteer to 
>> update
>> the script.
>>
>> To show the demo, I labeled several PRs manually. You can see the
>> result right now in Apache Spark PR page.
>>
>>   - https://github.com/apache/spark/pulls
>>
>> If you're surprised due to those manual activities, I want to
>> apologize for that. I hope we can take advantage of the existing GitHub
>> features to serve Apache Spark community in a way better than yesterday.
>>
>> How do you think about this specific suggestion?
>>
>> Bests,
>> Dongjoon
>>
>> PS. I saw that `Request Review` and `Assign` features are already
>> used for some purposes, but these feature are out of the scope in this
>> email.
>>
>


Re: Exposing JIRA issue types at GitHub PRs

2019-06-13 Thread Hyukjin Kwon
Yea, I think we can automate this process via, for instance,
https://github.com/apache/spark/blob/master/dev/github_jira_sync.py

+1 for such sort of automatic categorizing and matching metadata between
JIRA and github

Adding Josh and Sean as well.

On Thu, 13 Jun 2019, 13:17 Dongjoon Hyun,  wrote:

> Hi, All.
>
> Since we use both Apache JIRA and GitHub actively for Apache Spark
> contributions, we have lots of JIRAs and PRs consequently. One specific
> thing I've been longing to see is `Jira Issue Type` in GitHub.
>
> How about exposing JIRA issue types at GitHub PRs as GitHub `Labels`?
> There are two main benefits:
> 1. It helps the communication between the contributors and reviewers with
> more information.
> (In some cases, some people only visit GitHub to see the PR and
> commits)
> 2. `Labels` is searchable. We don't need to visit Apache Jira to search
> PRs to see a specific type.
> (For example, the reviewers can see and review 'BUG' PRs first by
> using `is:open is:pr label:BUG`.)
>
> Of course, this can be done automatically without human intervention.
> Since we already have GitHub Jenkins job to access JIRA/GitHub, that job
> can add the labels from the beginning. If needed, I can volunteer to update
> the script.
>
> To show the demo, I labeled several PRs manually. You can see the result
> right now in Apache Spark PR page.
>
>   - https://github.com/apache/spark/pulls
>
> If you're surprised due to those manual activities, I want to apologize
> for that. I hope we can take advantage of the existing GitHub features to
> serve Apache Spark community in a way better than yesterday.
>
> How do you think about this specific suggestion?
>
> Bests,
> Dongjoon
>
> PS. I saw that `Request Review` and `Assign` features are already used for
> some purposes, but these feature are out of the scope in this email.
>


Re: [ANNOUNCE] Announcing Apache Spark 2.3.3

2019-02-18 Thread Hyukjin Kwon
Yay! Good job Takeshi!

On Mon, 18 Feb 2019, 14:47 Takeshi Yamamuro  We are happy to announce the availability of Spark 2.3.3!
>
> Apache Spark 2.3.3 is a maintenance release, based on the branch-2.3
> maintenance branch of Spark. We strongly recommend all 2.3.x users to
> upgrade to this stable release.
>
> To download Spark 2.3.3, head over to the download page:
> http://spark.apache.org/downloads.html
>
> To view the release notes:
> https://spark.apache.org/releases/spark-release-2-3-3.html
>
> We would like to acknowledge all community members for contributing to
> this release. This release would not have been possible without you.
>
> Best,
> Takeshi
>
> --
> ---
> Takeshi Yamamuro
>


Re: [ANNOUNCE] Announcing Apache Spark 2.2.3

2019-01-16 Thread Hyukjin Kwon
Nice!

2019년 1월 16일 (수) 오전 11:55, Jiaan Geng 님이 작성:

> Glad to hear this.
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Python

2019-01-02 Thread Hyukjin Kwon
Yup, it's supported.

On Wed, 2 Jan 2019, 3:35 pm Gourav Sengupta  Hi,
> Can I please confirm which version of Python 3.x is supported by Spark 2.4?
>
> Regards,
> Gourav
>


Re: DataSourceV2 APIs creating multiple instances of DataSourceReader and hence not preserving the state

2018-10-09 Thread Hyukjin Kwon
I took a look for the codes.

val source = classOf[MyDataSource].getCanonicalName
spark.read.format(source).load().collect()

Looks indeed it calls twice.

First all: Looks it creates it first to read the schema for a logical plan

test.org.apache.spark.sql.sources.v2.MyDataSourceReader.(MyDataSourceReader.java:36)
test.org.apache.spark.sql.sources.v2.MyDataSource.createReader(MyDataSource.java:35)
org.apache.spark.sql.execution.datasources.v2.DataSourceV2Relation$SourceHelpers.createReader(DataSourceV2Relation.scala:155)
org.apache.spark.sql.execution.datasources.v2.DataSourceV2Relation$.create(DataSourceV2Relation.scala:172)
org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:204)
org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:167)

Second call: it creates another for its actual partitions in a physcal plan

test.org.apache.spark.sql.sources.v2.MyDataSourceReader.(MyDataSourceReader.java:36)
test.org.apache.spark.sql.sources.v2.MyDataSource.createReader(MyDataSource.java:35)
org.apache.spark.sql.execution.datasources.v2.DataSourceV2Relation$SourceHelpers.createReader(DataSourceV2Relation.scala:155)
org.apache.spark.sql.execution.datasources.v2.DataSourceV2Relation.newReader(DataSourceV2Relation.scala:61)
org.apache.spark.sql.execution.datasources.v2.DataSourceV2Strategy$.apply(DataSourceV2Strategy.scala:103)
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:63)
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:63)
scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435)
scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441)
scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93)
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:78)
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:75)
scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
scala.collection.Iterator$class.foreach(Iterator.scala:891)
scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157)
scala.collection.AbstractIterator.foldLeft(Iterator.scala:1334)
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:75)
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:67)
scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435)
scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441)
org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93)
org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:72)
org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:68)
org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:77)
org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:77)
org.apache.spark.sql.Dataset.withAction(Dataset.scala:3360)
org.apache.spark.sql.Dataset.collect(Dataset.scala:2783)


Skimming the API doc at DataSourceReader at branch-2.4, I haven’t found the
guarantee that the readers are created only once. If that’s documented
somewhere, we should fix it in 2.4.0. If not, I think it fine since both
calls are in driver side and it’s something able to work around for
instance static class or thread local in this case.

Forwarding to dev mailing list in case that this is something we haven't
foreseen.

2018년 10월 9일 (화) 오후 9:39, Shubham Chaurasia 님이
작성:

> Alright, so it is a big project which uses a SQL store underneath.
> I extracted out the minimal code and made a smaller project out of it and
> still it is creating multiple instances.
>
> Here is my project:
>
> ├── my-datasource.iml
> ├── pom.xml
> ├── src
> │   ├── main
> │   │   ├── java
> │   │   │   └── com
> │   │   │   └── shubham
> │   │   │   ├── MyDataSource.java
> │   │   │   └── reader
> │   │   │   └── MyDataSourceReader.java
>
>
> MyDataSource.java
> -
>
> package com.shubham;
>
> import com.shubham.reader.MyDataSourceReader;
> import org.apache.spark.sql.SaveMode;
> import org.apache.spark.sql.sources.v2.DataSourceOptions;
> import org.apache.spark.sql.sources.v2.DataSourceV2;
> import org.apache.spark.sql.sources.v2.ReadSupport;
> import org.apache.spark.sql.sources.v2.WriteSupport;
> import org.apache.spark.sql.sources.v2.reader.DataSourceReader;
> import org.apache.spark.sql.sources.v2.writer.DataSourceWriter;
> import org.apache.spark.sql.types.StructType;
>
> import java.util.Optional;
>
> public class MyDataSource implements DataSourceV2, ReadSupport, WriteSupport {
>
>   

Re: Should python-2 be supported in Spark 3.0?

2018-09-16 Thread Hyukjin Kwon
I think we can deprecate it in 3.x.0 and remove it in Spark 4.0.0. Many
people still use Python 2. Also, techincally 2.7 support is not officially
dropped yet - https://pythonclock.org/


2018년 9월 17일 (월) 오전 9:31, Aakash Basu 님이 작성:

> Removing support for an API in a major release makes poor sense,
> deprecating is always better. Removal can always be done two - three minor
> release later.
>
> On Mon 17 Sep, 2018, 6:49 AM Felix Cheung, 
> wrote:
>
>> I don’t think we should remove any API even in a major release without
>> deprecating it first...
>>
>>
>> --
>> *From:* Mark Hamstra 
>> *Sent:* Sunday, September 16, 2018 12:26 PM
>> *To:* Erik Erlandson
>> *Cc:* user@spark.apache.org; dev
>> *Subject:* Re: Should python-2 be supported in Spark 3.0?
>>
>> We could also deprecate Py2 already in the 2.4.0 release.
>>
>> On Sat, Sep 15, 2018 at 11:46 AM Erik Erlandson 
>> wrote:
>>
>>> In case this didn't make it onto this thread:
>>>
>>> There is a 3rd option, which is to deprecate Py2 for Spark-3.0, and
>>> remove it entirely on a later 3.x release.
>>>
>>> On Sat, Sep 15, 2018 at 11:09 AM, Erik Erlandson 
>>> wrote:
>>>
 On a separate dev@spark thread, I raised a question of whether or not
 to support python 2 in Apache Spark, going forward into Spark 3.0.

 Python-2 is going EOL  at
 the end of 2019. The upcoming release of Spark 3.0 is an opportunity to
 make breaking changes to Spark's APIs, and so it is a good time to consider
 support for Python-2 on PySpark.

 Key advantages to dropping Python 2 are:

- Support for PySpark becomes significantly easier.
- Avoid having to support Python 2 until Spark 4.0, which is likely
to imply supporting Python 2 for some time after it goes EOL.

 (Note that supporting python 2 after EOL means, among other things,
 that PySpark would be supporting a version of python that was no longer
 receiving security patches)

 The main disadvantage is that PySpark users who have legacy python-2
 code would have to migrate their code to python 3 to take advantage of
 Spark 3.0

 This decision obviously has large implications for the Apache Spark
 community and we want to solicit community feedback.


>>>


Re: How to make pyspark use custom python?

2018-09-06 Thread Hyukjin Kwon
Are you doubly sure if it is an issue in Spark? I used custom python
several times with setting it in PYSPARK_PYTHON before and it was no
problem.

2018년 9월 6일 (목) 오후 2:21, mithril 님이 작성:

> For better looking , please see
>
> https://stackoverflow.com/questions/52178406/howto-make-pyspark-use-custom-python
> <
> https://stackoverflow.com/questions/52178406/howto-make-pyspark-use-custom-python>
>
>
> --
>
>
> I am using zeppelin connect remote spark cluster.
>
> remote spark is using system python 2.7 .
>
> I want to switch to miniconda3, install a lib pyarrow.
> What I do is :
>
> 1. Download miniconda3, install some libs, scp miniconda3 folder to spark
> master and slaves.
> 2. adding `PYSPARK_PYTHON="/usr/local/miniconda3/bin/python"` to
> `spark-env.sh` in spark master and slaves.
> 3. restart spark and zeppelin
> 4. Running code
>
> %spark.pyspark
>
> import pandas as pd
> from pyspark.sql.functions import pandas_udf,PandasUDFType
>
>
> @pandas_udf(df.schema, PandasUDFType.GROUPED_MAP)
> def process_order_items(pdf):
>
> pdf.loc[:, 'total_price'] = pdf['price'] * pdf['count']
>
> d = {'has_discount':'count',
> 'clearance':'count',
> 'count': ['count', 'sum'],
> 'price_guide':'max',
> 'total_price': 'sum'
>
> }
>
> pdf1 = pdf.groupby('day').agg(d)
> pdf1.columns = pdf1.columns.map('_'.join)
> d1 = {'has_discount_count':'discount_order_count',
> 'clearance_count':'clearance_order_count',
> 'count_count':'order_count',
> 'count_sum':'sale_count',
> 'price_guide_max':'price_guide',
> 'total_price_sum': 'total_price'
> }
>
> pdf2 = pdf1.rename(columns=d1)
>
> pdf2.loc[:, 'discount_sale_count'] =
> pdf.loc[pdf.has_discount>0,
> 'count'].resample(freq).sum()
> pdf2.loc[:, 'clearance_sale_count'] = pdf.loc[pdf.clearance>0,
> 'count'].resample(freq).sum()
> pdf2.loc[:, 'price'] = pdf2.total_price / pdf2.sale_count
>
> pdf2 = pdf2.drop(pdf2[pdf2.order_count == 0].index)
>
> return pdf2
>
>
> results = df.groupby("store_id",
> "product_id").apply(process_order_items)
>
> results.select(['store_id', 'price']).show(5)
>
>
> Got error :
>
> Py4JJavaError: An error occurred while calling o172.showString.
> : org.apache.spark.SparkException: Job aborted due to stage failure:
> Task 0 in stage 6.0 failed 4 times, most recent failure: Lost task 0.3 in
> stage 6.0 (TID 143, 10.104.33.18, executor 2):
> org.apache.spark.api.python.PythonException: Traceback (most recent call
> last):
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/worker.py",
> line
> 230, in main
> process()
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/worker.py",
> line
> 225, in process
> serializer.dump_stream(func(split_index, iterator), outfile)
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/worker.py",
> line
> 150, in 
> func = lambda _, it: map(mapper, it)
>   File
> "/usr/local/spark/python/lib/pyspark.zip/pyspark/serializers.py",
> line 276, in load_stream
> import pyarrow as pa
> ImportError: No module named pyarrow
>
>
> `10.104.33.18` is spark master,  so I think the `PYSPARK_PYTHON` is not set
> correctly .
>
> `pyspark`
>
> I login to master and slaves, run `pyspark interpreter` in each, and found
> `import pyarrow` do not throw exception .
>
>
> PS: `pyarrow` also installed in the machine which running zeppelin.
>
> --
>
> More info:
>
>
> 1. spark cluster is installed in A, B, C , zeppelin is installed in D.
> 2. `PYSPARK_PYTHON` is set in `spark-env.sh` in each A, B, C
> 3. `import pyarrow` is fine with `/usr/local/spark/bin/pyspark` in A, B ,C
> /
> 4. `import pyarrow` is fine on A, B ,C custom python(miniconda3)
> 5. `import pyarrow` is fine on D's default python(miniconda3, path is
> different with A, B ,C , but it is doesn't matter)
>
>
>
> So I completely coundn't understand why it doesn't work.
>
>
>
>
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Issue upgrading to Spark 2.3.1 (Maintenance Release)

2018-06-15 Thread Hyukjin Kwon
I use PyCharm. Mind if I ask to elaborate what you did step by step?

2018년 6월 16일 (토) 오전 12:11, Marcelo Vanzin 님이
작성:

> I'm not familiar with PyCharm. But if you can run "pyspark" from the
> command line and not hit this, then this might be an issue with
> PyCharm or your environment - e.g. having an old version of the
> pyspark code around, or maybe PyCharm itself might need to be updated.
>
> On Thu, Jun 14, 2018 at 10:01 PM, Aakash Basu
>  wrote:
> > Hi,
> >
> > Downloaded the latest Spark version because the of the fix for "ERROR
> > AsyncEventQueue:70 - Dropping event from queue appStatus."
> >
> > After setting environment variables and running the same code in PyCharm,
> > I'm getting this error, which I can't find a solution of.
> >
> > Exception in thread "main" java.util.NoSuchElementException: key not
> found:
> > _PYSPARK_DRIVER_CONN_INFO_PATH
> > at scala.collection.MapLike$class.default(MapLike.scala:228)
> > at scala.collection.AbstractMap.default(Map.scala:59)
> > at scala.collection.MapLike$class.apply(MapLike.scala:141)
> > at scala.collection.AbstractMap.apply(Map.scala:59)
> > at
> >
> org.apache.spark.api.python.PythonGatewayServer$.main(PythonGatewayServer.scala:64)
> > at
> >
> org.apache.spark.api.python.PythonGatewayServer.main(PythonGatewayServer.scala)
> > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> > at
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> > at
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> > at java.lang.reflect.Method.invoke(Method.java:498)
> > at
> >
> org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
> > at
> >
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:894)
> > at
> > org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:198)
> > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:228)
> > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:137)
> > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> >
> >
> > Any help?
> >
> > Thanks,
> > Aakash.
>
>
>
> --
> Marcelo
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Issue upgrading to Spark 2.3.1 (Maintenance Release)

2018-06-15 Thread Hyukjin Kwon
I use PyCharm. Mind if I ask to elaborate what you did step by step?

2018년 6월 16일 (토) 오전 12:11, Marcelo Vanzin 님이
작성:

> I'm not familiar with PyCharm. But if you can run "pyspark" from the
> command line and not hit this, then this might be an issue with
> PyCharm or your environment - e.g. having an old version of the
> pyspark code around, or maybe PyCharm itself might need to be updated.
>
> On Thu, Jun 14, 2018 at 10:01 PM, Aakash Basu
>  wrote:
> > Hi,
> >
> > Downloaded the latest Spark version because the of the fix for "ERROR
> > AsyncEventQueue:70 - Dropping event from queue appStatus."
> >
> > After setting environment variables and running the same code in PyCharm,
> > I'm getting this error, which I can't find a solution of.
> >
> > Exception in thread "main" java.util.NoSuchElementException: key not
> found:
> > _PYSPARK_DRIVER_CONN_INFO_PATH
> > at scala.collection.MapLike$class.default(MapLike.scala:228)
> > at scala.collection.AbstractMap.default(Map.scala:59)
> > at scala.collection.MapLike$class.apply(MapLike.scala:141)
> > at scala.collection.AbstractMap.apply(Map.scala:59)
> > at
> >
> org.apache.spark.api.python.PythonGatewayServer$.main(PythonGatewayServer.scala:64)
> > at
> >
> org.apache.spark.api.python.PythonGatewayServer.main(PythonGatewayServer.scala)
> > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> > at
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> > at
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> > at java.lang.reflect.Method.invoke(Method.java:498)
> > at
> >
> org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
> > at
> >
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:894)
> > at
> > org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:198)
> > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:228)
> > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:137)
> > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> >
> >
> > Any help?
> >
> > Thanks,
> > Aakash.
>
>
>
> --
> Marcelo
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: how to set up pyspark eclipse, pyDev, virtualenv? syntaxError: yield from walk(

2018-04-05 Thread Hyukjin Kwon
FYI, there is a PR and JIRA for virtualEnv support in PySpark

https://issues.apache.org/jira/browse/SPARK-13587
https://github.com/apache/spark/pull/13599


2018-04-06 7:48 GMT+08:00 Andy Davidson :

> FYI
>
> http://www.learn4master.com/algorithms/pyspark-unit-test-
> set-up-sparkcontext
>
> From: Andrew Davidson 
> Date: Wednesday, April 4, 2018 at 5:36 PM
> To: "user @spark" 
> Subject: how to set up pyspark eclipse, pyDev, virtualenv? syntaxError:
> yield from walk(
>
> I am having a heck of a time setting up my development environment. I used
> pip to install pyspark. I also downloaded spark from apache.
>
> My eclipse pyDev intereperter is configured as a python3 virtualenv
>
> I have a simple unit test that loads a small dataframe. Df.show()
> generates the following error
>
>
> 2018-04-04 17:13:56 ERROR Executor:91 - Exception in task 0.0 in stage 0.0
> (TID 0)
>
> org.apache.spark.SparkException:
>
> Error from python worker:
>
>   Traceback (most recent call last):
>
> File "/Users/a/workSpace/pythonEnv/spark-2.3.0/lib/python3.6/site.py",
> line 67, in 
>
>   import os
>
> File "/Users/a/workSpace/pythonEnv/spark-2.3.0/lib/python3.6/os.py",
> line 409
>
>   yield from walk(new_path, topdown, onerror, followlinks)
>
>^
>
>   SyntaxError: invalid syntax
>
>
>
> My unittest classs is dervied from.
>
>
> class PySparkTestCase(unittest.TestCase):
>
>
> @classmethod
>
> def setUpClass(cls):
>
> conf = SparkConf().setMaster("local[2]") \
>
> .setAppName(cls.__name__) #\
>
> # .set("spark.authenticate.secret", "11")
>
> cls.sparkContext = SparkContext(conf=conf)
>
> sc_values[cls.__name__] = cls.sparkContext
>
> cls.sqlContext = SQLContext(cls.sparkContext)
>
> print("aedwip:", SparkContext)
>
>
> @classmethod
>
> def tearDownClass(cls):
>
> print("calling stop tearDownClas, the content of sc_values=",
> sc_values)
>
> sc_values.clear()
>
> cls.sparkContext.stop()
>
>
> This looks similar to Class  PySparkTestCase in https://github.com/apache/
> spark/blob/master/python/pyspark/tests.py
>
>
> Any suggestions would be greatly appreciated.
>
>
> Andy
>
>
> My downloaed version is spark-2.3.0-bin-hadoop2.7
>
>
> My virtual env version is
>
> (spark-2.3.0) $ pip show pySpark
>
> Name: pyspark
>
> Version: 2.3.0
>
> Summary: Apache Spark Python API
>
> Home-page: https://github.com/apache/spark/tree/master/python
>
> Author: Spark Developers
>
> Author-email: d...@spark.apache.org
>
> License: http://www.apache.org/licenses/LICENSE-2.0
>
> Location: /Users/a/workSpace/pythonEnv/spark-2.3.0/lib/python3.6/
> site-packages
>
> Requires: py4j
>
> (spark-2.3.0) $
>
>
> (spark-2.3.0) $ python --version
>
> Python 3.6.1
>
> (spark-2.3.0) $
>
>
>


Re: [PySpark SQL] sql function to_date and to_timestamp return the same data type

2018-03-18 Thread Hyukjin Kwon
Mind if I ask a reproducer? seems returning timestamps fine:

>>> from pyspark.sql.functions import *
>>> spark.range(1).select(to_timestamp(current_timestamp())).printSchema()
root
 |-- to_timestamp(current_timestamp()): timestamp (nullable = false)

>>> spark.range(1).select(to_timestamp(current_timestamp())).show()
+-+
|to_timestamp(current_timestamp())|
+-+
| 2018-03-19 14:45:...|
+-+

>>> spark.range(1).select(current_timestamp().cast("timestamp")).printSchema()
root
 |-- CAST(current_timestamp() AS TIMESTAMP): timestamp (nullable = false)

>>> spark.range(1).select(current_timestamp().cast("timestamp")).show()
+--+
|CAST(current_timestamp() AS TIMESTAMP)|
+--+
|  2018-03-19 14:45:...|
+--+

​



2018-03-16 9:00 GMT+09:00 Alan Featherston Lago :

> I'm a pretty new user of spark and I've run into this issue with the
> pyspark docs:
>
> The functions pyspark.sql.functions.to_date && 
> pyspark.sql.functions.to_timestamp
> behave in the same way. As in both functions convert a Column of
> pyspark.sql.types.StringType or pyspark.sql.types.TimestampType into
> pyspark.sql.types.DateType.
>
> Shouldn't the function `to_timestmap` return pyspark.sql.types.
> TimestampType?
> Also the to_timestamp docs say that "By default, it follows casting rules
> to pyspark.sql.types.TimestampType if the format is omitted (equivalent
> to col.cast("timestamp")). ", which doesn't seem to be right ie:
>
> to_timestamp(current_timestamp()) <> current_timestamp().cast("timestamp")
>
>
> This is wrong right? or am I missing something? (is this due to the
> underlying jvm data types?)
>
>
> Cheers,
> alan
>


Re: SparkR test script issue: unable to run run-tests.h on spark 2.2

2018-02-14 Thread Hyukjin Kwon
>From a very quick look, I think testthat version issue with SparkR.

I had to fix that version to 1.x before in AppVeyor. There are few details
in https://github.com/apache/spark/pull/20003

Can you check and lower testthat version?


On 14 Feb 2018 6:09 pm, "chandan prakash"  wrote:

> Hi All,
> I am trying to run test script of R under ./R/run-tests.sh but hitting
> same ERROR everytime.
> I tried running on mac as well as centos machine, same issue coming up.
> I am using spark 2.2 (branch-2.2)
> I followed from apache doc and followed the steps:
> 1. installed R
> 2. installed packages like testthat as mentioned in doc
> 3. run run-tests.h
>
>
> Every time I am getting this error line:
>
> Error in get(name, envir = asNamespace(pkg), inherits = FALSE) :
>   object 'run_tests' not found
> Calls: ::: -> get
> Execution halted
>
>
> Any Help?
>
> --
> Chandan Prakash
>
>


Re: Custom line/record delimiter

2018-01-01 Thread Hyukjin Kwon
Hi,


There's a PR - https://github.com/apache/spark/pull/18581 and JIRA
- SPARK-21289

Alternatively, you could check out multiLine option for CSV and see if
applicable.


Thanks.


2017-12-30 2:19 GMT+09:00 sk skk :

> Hi,
>
> Do we have an option to write a csv or text file with a custom record/line
> separator through spark ?
>
> I could not find any ref on the api. I have a issue while loading data
> into a warehouse as one of the column on csv have a new line character and
> the warehouse is not letting to escape that new line character .
>
> Thank you ,
> Sk
>


Re: Reading CSV with multiLine option invalidates encoding option.

2017-08-17 Thread Hyukjin Kwon
For when multiLine is not set, we currently only support ascii-compatible
encodings, up to my knowledge, mainly due to line separator and as I
investigated in the comment.
For when multiLine is set, it appears encoding is not considered. I
actually meant encoding does not work at all in this case in the comment
but it looks I should have been clearer on this.

I have been aware of it but I personally think encoding option is rather
left incomplete due to non-ascii compatible encodings and this actually
brings complexity. For at least over a year, I have been (personally)
wondering if we should keep extending this feature and if we could rather
deprecate this option.

The direction itself in your diff looks roughly correct and I can't deny
that's a valid issue and fix for the current status.

Workaround should be, to make a custom Hadoop input format and read it as
text dataset and parse it with DataFrameReader.csv(csvDataset:
Dataset[String]) for now.



2017-08-17 19:42 GMT+09:00 Han-Cheol Cho :

> Hi,
>
> Thank you for your response.
> I finally found the cause of this
>
>
> When multiLine option is set, input file is read by
> UnivocityParser.parseStream() method.
> This method, in turn, calls convertStream() that initializes tokenizer
> with tokenizer.beginParsing(inputStream) and parses records using
> tokenizer.parseNext().
>
> The problem is that beginParsing() method uses UTF-8 as its default
> char-encoding.
> As a result, user provided "encoding" option will be ignored.
>
>
> When multiLine option is NOT set, on the other hand, input file is first
> read and decoded from TextInputCSVDataSource.readFile() method.
> Then, it is sent to UnivocityParser.parseIterator() method.
> Therefore, no problem is occurred in in this case.
>
>
> To solve this problem, I removed the call for tokenizer.beginParsing()
> method in convertStream() since we cannot access options.charset variable
> here.
> Then, added it to two places: tokenizeStream() and parseStream() methods.
> Especially, in parseStream() method, I added charset as the second
> parameter for beginParsing() method.
>
> I attached git diff content as an attachment file.
> I appreciate any comments on this.
>
>
> Best wishes,
> Han-Cheol
>
>
>
>
> On Wed, Aug 16, 2017 at 3:09 PM, Takeshi Yamamuro 
> wrote:
>
>> Hi,
>>
>> Since the csv source currently supports ascii-compatible charset, so I
>> guess shift-jis also works well.
>> You could check Hyukjin's comment in https://issues.apache.org/j
>> ira/browse/SPARK-21289 for more info.
>>
>>
>> On Wed, Aug 16, 2017 at 2:54 PM, Han-Cheol Cho 
>> wrote:
>>
>>> My apologies,
>>>
>>> It was a problem of our Hadoop cluster.
>>> When we tested the same code on another cluster (HDP-based), it worked
>>> without any problem.
>>>
>>> ```scala
>>> ## make sjis text
>>> cat a.txt
>>> 8月データだけでやってみよう
>>> nkf -W -s a.txt >b.txt
>>> cat b.txt
>>> 87n%G!<%?$@$1$G$d$C$F$_$h$&
>>> nkf -s -w b.txt
>>> 8月データだけでやってみよう
>>> hdfs dfs -put a.txt b.txt
>>>
>>> ## YARN mode test
>>> spark.read.option("encoding", "utf-8").csv("a.txt").show(1)
>>> +--+
>>> |   _c0|
>>> +--+
>>> |8月データだけでやってみよう|
>>> +--+
>>>
>>> spark.read.option("encoding", "sjis").csv("b.txt").show(1)
>>> +--+
>>> |   _c0|
>>> +--+
>>> |8月データだけでやってみよう|
>>> +--+
>>>
>>> spark.read.option("encoding", "utf-8").option("multiLine",
>>> true).csv("a.txt").show(1)
>>> +--+
>>> |   _c0|
>>> +--+
>>> |8月データだけでやってみよう|
>>> +--+
>>>
>>> spark.read.option("encoding", "sjis").option("multiLine",
>>> true).csv("b.txt").show(1)
>>> +--+
>>> |   _c0|
>>> +--+
>>> |8月データだけでやってみよう|
>>> +--+
>>> ```
>>>
>>> I am still digging the root cause and will share it later :-)
>>>
>>> Best wishes,
>>> Han-Choel
>>>
>>>
>>> On Wed, Aug 16, 2017 at 1:32 PM, Han-Cheol Cho 
>>> wrote:
>>>
 Dear Spark ML members,


 I experienced a trouble in using "multiLine" option to load CSV data
 with Shift-JIS encoding.
 When option("multiLine", true) is specified, option("encoding",
 "encoding-name") just doesn't work anymore.


 In CSVDataSource.scala file, I found that MultiLineCSVDataSource.readFile()
 method doesn't use parser.options.charset at all.

 object MultiLineCSVDataSource extends CSVDataSource {
   override val isSplitable: Boolean = false

   override def readFile(
   conf: Configuration,
   file: PartitionedFile,
   parser: UnivocityParser,
   schema: StructType): Iterator[InternalRow] = {
 UnivocityParser.parseStream(
   CodecStreams.createInputStreamWithCloseResource(conf,
 file.filePath),
   parser.options.headerFlag,
   parser,
   schema)
   }
   ...

 On the other hand, 

Re: how to set the assignee in JIRA please?

2017-07-25 Thread Hyukjin Kwon
That's waiting for a review as seen. There have been few discussions about
this. I am personally against closing only because it is old.

I have made periodically PRs to close other inactive PRs (e.g., not
responsive to review comments or Jenkins failures).

So, I guess most of such PRs are waiting for a review.


How about drawing attention to the PRs you see a value on with a proper
explanation and investigation, or try to review rather than suggesting just
closing?

I think anyone in this community can review.



2017-07-26 11:58 GMT+09:00 萝卜丝炒饭 <1427357...@qq.com>:

> Hi all,
>
> I find some PR were created one year ago, the last comment is several
> monthes before.
> No one to close or reject it.
> Such as 6880, just put it like this?
>
>
> ---Original---
> *From:* "Hyukjin Kwon"<gurwls...@gmail.com>
> *Date:* 2017/7/25 09:25:28
> *To:* "萝卜丝炒饭"<1427357...@qq.com>;
> *Cc:* "user @spark"<user@spark.apache.org>;"Marcelo Vanzin"<
> van...@cloudera.com>;
> *Subject:* Re: how to set the assignee in JIRA please?
>
> I think that's described in the link I used - http://spark.apache.org/cont
> ributing.html.
>
> On 25 Jul 2017 10:22 am, "萝卜丝炒饭" <1427357...@qq.com> wrote:
>
> Another issue about contribution.
>
> After a pull request is created, what should creator do next please?
> Who will close it please?
>
> ---Original---
> *From:* "Hyukjin Kwon"<gurwls...@gmail.com>
> *Date:* 2017/7/25 09:15:49
> *To:* "Marcelo Vanzin"<van...@cloudera.com>;
> *Cc:* "user"<user@spark.apache.org>;"萝卜丝炒饭"<1427357...@qq.com>;
> *Subject:* Re: how to set the assignee in JIRA please?
>
> I see. In any event, it sounds not required to work on an issue -
> http://spark.apache.org/contributing.html .
>
> "... There is no need to be the Assignee of the JIRA to work on it, though
> you are welcome to comment that you have begun work.."
>
> and I was just wondering out of my curiosity. It should be not a big deal
> anyway.
>
>
> Thanks for the details.
>
>
>
> 2017-07-25 10:09 GMT+09:00 Marcelo Vanzin <van...@cloudera.com>:
>
>> On Mon, Jul 24, 2017 at 6:04 PM, Hyukjin Kwon <gurwls...@gmail.com>
>> wrote:
>> > However, I see some JIRAs are assigned to someone time to time. Were
>> those
>> > mistakes or would you mind if I ask when someone is assigned?
>>
>> I'm not sure if there are any guidelines of when to assign; since
>> there has been an agreement that bugs should remain unassigned I don't
>> think I've personally done it, although I have seen others do it. In
>> general I'd say it's ok if there's a good justification for it (e.g.
>> "this is a large change and this person who is an active contributor
>> will work on it"), but in the general case should be avoided.
>>
>> I agree it's a little confusing, especially comparing to other
>> projects, but it's how it's been done for a couple of years at least
>> (or at least what I have understood).
>>
>>
>> --
>> Marcelo
>>
>
>
>


Re: how to set the assignee in JIRA please?

2017-07-24 Thread Hyukjin Kwon
I think that's described in the link I used - http://spark.apache.org/
contributing.html.

On 25 Jul 2017 10:22 am, "萝卜丝炒饭" <1427357...@qq.com> wrote:

Another issue about contribution.

After a pull request is created, what should creator do next please?
Who will close it please?

---Original---
*From:* "Hyukjin Kwon"<gurwls...@gmail.com>
*Date:* 2017/7/25 09:15:49
*To:* "Marcelo Vanzin"<van...@cloudera.com>;
*Cc:* "user"<user@spark.apache.org>;"萝卜丝炒饭"<1427357...@qq.com>;
*Subject:* Re: how to set the assignee in JIRA please?

I see. In any event, it sounds not required to work on an issue -
http://spark.apache.org/contributing.html .

"... There is no need to be the Assignee of the JIRA to work on it, though
you are welcome to comment that you have begun work.."

and I was just wondering out of my curiosity. It should be not a big deal
anyway.


Thanks for the details.



2017-07-25 10:09 GMT+09:00 Marcelo Vanzin <van...@cloudera.com>:

> On Mon, Jul 24, 2017 at 6:04 PM, Hyukjin Kwon <gurwls...@gmail.com> wrote:
> > However, I see some JIRAs are assigned to someone time to time. Were
> those
> > mistakes or would you mind if I ask when someone is assigned?
>
> I'm not sure if there are any guidelines of when to assign; since
> there has been an agreement that bugs should remain unassigned I don't
> think I've personally done it, although I have seen others do it. In
> general I'd say it's ok if there's a good justification for it (e.g.
> "this is a large change and this person who is an active contributor
> will work on it"), but in the general case should be avoided.
>
> I agree it's a little confusing, especially comparing to other
> projects, but it's how it's been done for a couple of years at least
> (or at least what I have understood).
>
>
> --
> Marcelo
>


Re: how to set the assignee in JIRA please?

2017-07-24 Thread Hyukjin Kwon
I see. In any event, it sounds not required to work on an issue -
http://spark.apache.org/contributing.html .

"... There is no need to be the Assignee of the JIRA to work on it, though
you are welcome to comment that you have begun work.."

and I was just wondering out of my curiosity. It should be not a big deal
anyway.


Thanks for the details.



2017-07-25 10:09 GMT+09:00 Marcelo Vanzin <van...@cloudera.com>:

> On Mon, Jul 24, 2017 at 6:04 PM, Hyukjin Kwon <gurwls...@gmail.com> wrote:
> > However, I see some JIRAs are assigned to someone time to time. Were
> those
> > mistakes or would you mind if I ask when someone is assigned?
>
> I'm not sure if there are any guidelines of when to assign; since
> there has been an agreement that bugs should remain unassigned I don't
> think I've personally done it, although I have seen others do it. In
> general I'd say it's ok if there's a good justification for it (e.g.
> "this is a large change and this person who is an active contributor
> will work on it"), but in the general case should be avoided.
>
> I agree it's a little confusing, especially comparing to other
> projects, but it's how it's been done for a couple of years at least
> (or at least what I have understood).
>
>
> --
> Marcelo
>


Re: how to set the assignee in JIRA please?

2017-07-24 Thread Hyukjin Kwon
However, I see some JIRAs are assigned to someone time to time. Were those
mistakes or would you mind if I ask when someone is assigned?

When I started to contribute to Spark few years ago, I was confused by this
and I am pretty sure some guys are still confused.

I do usually say something like "it is generally not in that way" too when
I am asked but I find myself unable to explain further.



2017-07-25 9:59 GMT+09:00 Marcelo Vanzin :

> We don't generally set assignees. Submit a PR on github and the PR
> will be linked on JIRA; if your PR is submitted, then the bug is
> assigned to you.
>
> On Mon, Jul 24, 2017 at 5:57 PM, 萝卜丝炒饭 <1427357...@qq.com> wrote:
> > Hi all,
> > If I want to do some work about an issue registed in JIRA, how to set the
> > assignee to me please?
> >
> > thanks
> >
> >
>
>
>
> --
> Marcelo
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: to_json not working with selectExpr

2017-07-19 Thread Hyukjin Kwon
Yes, I guess it is.

2017-07-20 11:31 GMT+09:00 Matthew cao <cybea...@gmail.com>:

> AH, I get it. So that’s why I get the not register error? Cuz it not added
> into SQL in 2.1.0?
>
> On 2017年7月19日, at 22:35, Hyukjin Kwon <gurwls...@gmail.com> wrote:
>
> Yea, but it was added into SQL from Spark 2.2.0
>
> 2017-07-19 23:02 GMT+09:00 Matthew cao <cybea...@gmail.com>:
>
>> I am using version 2.1.1 As I could remember, this function was added
>> since 2.1.0.
>>
>> On 2017年7月17日, at 12:05, Burak Yavuz <brk...@gmail.com> wrote:
>>
>> Hi Matthew,
>>
>> Which Spark version are you using? The expression `to_json` was added in
>> 2.2 with this commit:
>> https://github.com/apache/spark/commit/0cdcf9114527a2c359c25
>> e46fd6556b3855bfb28
>>
>> Best,
>> Burak
>>
>> On Sun, Jul 16, 2017 at 6:24 PM, Matthew cao <cybea...@gmail.com> wrote:
>>
>>> Hi all,
>>> I just read the databricks blog here: https://docs.databricks.
>>> com/_static/notebooks/complex-nested-structured.html
>>>
>>> When I try to follow the example about the to_json and selectExpr part,
>>> it gave error: “org.apache.spark.sql.AnalysisException: Undefined
>>> function: 'to_json'. This function is neither a registered temporary
>>> function nor a permanent function registered in the database 'default'.;
>>> line 1 pos 0”.
>>> Also this error show in the original databricks notebook. I know that
>>> to_json function works great with select. Do I miss something when using
>>> selectExpr? THX.
>>>
>>> Best,
>>> Matthew
>>>
>>>
>>
>>
>
>


Re: to_json not working with selectExpr

2017-07-19 Thread Hyukjin Kwon
Yea, but it was added into SQL from Spark 2.2.0

2017-07-19 23:02 GMT+09:00 Matthew cao :

> I am using version 2.1.1 As I could remember, this function was added
> since 2.1.0.
>
> On 2017年7月17日, at 12:05, Burak Yavuz  wrote:
>
> Hi Matthew,
>
> Which Spark version are you using? The expression `to_json` was added in
> 2.2 with this commit:
> https://github.com/apache/spark/commit/0cdcf9114527a2c359c25e46fd6556
> b3855bfb28
>
> Best,
> Burak
>
> On Sun, Jul 16, 2017 at 6:24 PM, Matthew cao  wrote:
>
>> Hi all,
>> I just read the databricks blog here: https://docs.databricks.
>> com/_static/notebooks/complex-nested-structured.html
>>
>> When I try to follow the example about the to_json and selectExpr part,
>> it gave error: “org.apache.spark.sql.AnalysisException: Undefined
>> function: 'to_json'. This function is neither a registered temporary
>> function nor a permanent function registered in the database 'default'.;
>> line 1 pos 0”.
>> Also this error show in the original databricks notebook. I know that
>> to_json function works great with select. Do I miss something when using
>> selectExpr? THX.
>>
>> Best,
>> Matthew
>>
>>
>
>


Re: With 2.2.0 PySpark is now available for pip install from PyPI :)

2017-07-12 Thread Hyukjin Kwon
Cool!

2017-07-13 9:43 GMT+09:00 Denny Lee :

> This is amazingly awesome! :)
>
> On Wed, Jul 12, 2017 at 13:23 lucas.g...@gmail.com 
> wrote:
>
>> That's great!
>>
>>
>>
>> On 12 July 2017 at 12:41, Felix Cheung  wrote:
>>
>>> Awesome! Congrats!!
>>>
>>> --
>>> *From:* holden.ka...@gmail.com  on behalf of
>>> Holden Karau 
>>> *Sent:* Wednesday, July 12, 2017 12:26:00 PM
>>> *To:* user@spark.apache.org
>>> *Subject:* With 2.2.0 PySpark is now available for pip install from
>>> PyPI :)
>>>
>>> Hi wonderful Python + Spark folks,
>>>
>>> I'm excited to announce that with Spark 2.2.0 we finally have PySpark
>>> published on PyPI (see https://pypi.python.org/pypi/pyspark /
>>> https://twitter.com/holdenkarau/status/885207416173756417). This has
>>> been a long time coming (previous releases included pip installable
>>> artifacts that for a variety of reasons couldn't be published to PyPI). So
>>> if you (or your friends) want to be able to work with PySpark locally on
>>> your laptop you've got an easier path getting started (pip install pyspark).
>>>
>>> If you are setting up a standalone cluster your cluster will still need
>>> the "full" Spark packaging, but the pip installed PySpark should be able to
>>> work with YARN or an existing standalone cluster installation (of the same
>>> version).
>>>
>>> Happy Sparking y'all!
>>>
>>> Holden :)
>>>
>>>
>>> --
>>> Cell : 425-233-8271 <(425)%20233-8271>
>>> Twitter: https://twitter.com/holdenkarau
>>>
>>
>>


Re: Multiple CSV libs causes issues spark 2.1

2017-05-09 Thread Hyukjin Kwon
Sounds like it is related with https://github.com/apache/spark/pull/17916

We will allow pick up the internal one if this one gets merged.

On 10 May 2017 7:09 am, "Mark Hamstra"  wrote:

> Looks to me like it is a conflict between a Databricks library and Spark
> 2.1. That's an issue for Databricks to resolve or provide guidance.
>
> On Tue, May 9, 2017 at 2:36 PM, lucas.g...@gmail.com  > wrote:
>
>> I'm a bit confused by that answer, I'm assuming it's spark deciding which
>> lib to use.
>>
>> On 9 May 2017 at 14:30, Mark Hamstra  wrote:
>>
>>> This looks more like a matter for Databricks support than spark-user.
>>>
>>> On Tue, May 9, 2017 at 2:02 PM, lucas.g...@gmail.com <
>>> lucas.g...@gmail.com> wrote:
>>>
 df = spark.sqlContext.read.csv('out/df_in.csv')
>


> 17/05/09 15:51:29 WARN ObjectStore: Version information not found in
> metastore. hive.metastore.schema.verification is not enabled so
> recording the schema version 1.2.0
> 17/05/09 15:51:29 WARN ObjectStore: Failed to get database default,
> returning NoSuchObjectException
> 17/05/09 15:51:30 WARN ObjectStore: Failed to get database
> global_temp, returning NoSuchObjectException
>


> Py4JJavaError: An error occurred while calling o72.csv.
> : java.lang.RuntimeException: Multiple sources found for csv 
> (*com.databricks.spark.csv.DefaultSource15,
> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat*),
> please specify the fully qualified class name.
> at scala.sys.package$.error(package.scala:27)
> at org.apache.spark.sql.execution.datasources.DataSource$.looku
> pDataSource(DataSource.scala:591)
> at org.apache.spark.sql.execution.datasources.DataSource.provid
> ingClass$lzycompute(DataSource.scala:86)
> at org.apache.spark.sql.execution.datasources.DataSource.provid
> ingClass(DataSource.scala:86)
> at org.apache.spark.sql.execution.datasources.DataSource.resolv
> eRelation(DataSource.scala:325)
> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.sc
> ala:152)
> at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:415)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAcce
> ssorImpl.java:57)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMe
> thodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
> at py4j.Gateway.invoke(Gateway.java:280)
> at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.j
> ava:132)
> at py4j.commands.CallCommand.execute(CallCommand.java:79)
> at py4j.GatewayConnection.run(GatewayConnection.java:214) at
> java.lang.Thread.run(Thread.java:745)


 When I change our call to:

 df = spark.hiveContext.read \
 .format('org.apache.spark.sql.execution.datasources.csv.CSVFileFormat')
 \
 .load('df_in.csv)

 No such issue, I was under the impression (obviously wrongly) that
 spark would automatically pick the local lib.  We have the databricks
 library because other jobs still explicitly call it.

 Is the 'correct answer' to go through and modify so as to remove the
 databricks lib / remove it from our deploy?  Or should this just work?

 One of the things I find less helpful in the spark docs are when
 there's multiple ways to do it but no clear guidance on what those methods
 are intended to accomplish.

 Thanks!

>>>
>>>
>>
>


Re: Why selectExpr changes schema (to include id column)?

2017-03-27 Thread Hyukjin Kwon
Thanks for your confirmation.

On 28 Mar 2017 5:02 a.m., "Jacek Laskowski" <ja...@japila.pl> wrote:

Hi Hyukjin,

It was a false alarm as I had a local change to `def schema` in
`Dataset` that caused the issue.

I apologize for the noise. Sorry and thanks a lot for the prompt
response. I appreciate.

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark 2 https://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski


On Mon, Mar 27, 2017 at 2:43 PM, Hyukjin Kwon <gurwls...@gmail.com> wrote:
> I just tried to build against the current master to help check -
> https://github.com/apache/spark/commit/3fbf0a5f9297f438bc92db11f106d4
a0ae568613
>
> It seems I can't reproduce this as below:
>
>
> scala> spark.range(1).printSchema
> root
>  |-- id: long (nullable = false)
>
>
> scala> spark.range(1).selectExpr("*").printSchema
> root
>  |-- id: long (nullable = false)
>
>
> scala> spark.version
> res2: String = 2.2.0-SNAPSHOT
>
>
>
>
> 2017-03-27 17:58 GMT+09:00 Jacek Laskowski <ja...@japila.pl>:
>>
>> Hi,
>>
>> While toying with selectExpr I've noticed that the schema changes to
>> include id column. I can't seem to explain it. Anyone?
>>
>> scala> spark.range(1).printSchema
>> root
>>  |-- value: long (nullable = true)
>>
>> scala> spark.range(1).selectExpr("*").printSchema
>> root
>>  |-- id: long (nullable = false)
>>
>> p.s. http://stackoverflow.com/q/43041975/1305344
>>
>> Pozdrawiam,
>> Jacek Laskowski
>> 
>> https://medium.com/@jaceklaskowski/
>> Mastering Apache Spark 2 https://bit.ly/mastering-apache-spark
>> Follow me at https://twitter.com/jaceklaskowski
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>


Re: Why selectExpr changes schema (to include id column)?

2017-03-27 Thread Hyukjin Kwon
I just tried to build against the current master to help check -
https://github.com/apache/spark/commit/3fbf0a5f9297f438bc92db11f106d4a0ae568613

It seems I can't reproduce this as below:


scala> spark.range(1).printSchema
root
 |-- id: long (nullable = false)


scala> spark.range(1).selectExpr("*").printSchema
root
 |-- id: long (nullable = false)


scala> spark.version
res2: String = 2.2.0-SNAPSHOT




2017-03-27 17:58 GMT+09:00 Jacek Laskowski :

> Hi,
>
> While toying with selectExpr I've noticed that the schema changes to
> include id column. I can't seem to explain it. Anyone?
>
> scala> spark.range(1).printSchema
> root
>  |-- value: long (nullable = true)
>
> scala> spark.range(1).selectExpr("*").printSchema
> root
>  |-- id: long (nullable = false)
>
> p.s. http://stackoverflow.com/q/43041975/1305344
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark 2 https://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: CSV empty columns handling in Spark 2.0.2

2017-03-16 Thread Hyukjin Kwon
I think this is fixed in https://github.com/apache/spark/pull/15767

This should be fixed in 2.1.0.


2017-03-17 3:28 GMT+09:00 George Obama :

> Hello,
>
>
>
> I am using spark 2.0.2 to read the CSV file with empty columns and is
> hitting the issue:
>
> scala>val df = sqlContext.read.option("header", true).option("inferSchema", 
> true).csv("file location")
>
> 17/03/13 07:26:26 WARN DataSource: Error while looking for metadata directory.
>
>
> scala> df.show()
>
> 17/03/13 07:26:41 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 4)
> java.lang.NumberFormatException: null
>
> at java.lang.Integer.parseInt(Integer.java:542)
> at java.lang.Integer.parseInt(Integer.java:615)
> at 
> scala.collection.immutable.StringLike$class.toInt(StringLike.scala:272)
> at scala.collection.immutable.StringOps.toInt(StringOps.scala:29)
> at 
> org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:241)
> at 
> org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:116)
> at 
> org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:85)
> at 
> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:128)
> at 
> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:127)
> at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>
>
> Could you help me with this issue please?
>
>
> Besides, what does the warning mean?
>
> 17/03/13 07:26:26 WARN DataSource: Error while looking for metadata directory.
>
> Regards, John
>
>
>


Re: [Spark CSV]: Use Custom TextInputFormat to Prevent Exceptions

2017-03-15 Thread Hyukjin Kwon
Other options are maybe :

- "spark.sql.files.ignoreCorruptFiles" option

- DataFrameReader.csv(csvDataset: Dataset[String]) with custom inputformat
(this is available from Spark 2.2.0).

For example,

val rdd = spark.sparkContext.newAPIHadoopFile("/tmp/abcd",
  classOf[org.apache.hadoop.mapreduce.lib.input.TextInputFormat],
  classOf[org.apache.hadoop.io.LongWritable],
  classOf[org.apache.hadoop.io.Text])
val stringRdd = rdd.map(pair => new String(pair._2.getBytes, 0,
pair._2.getLength))

spark.read.csv(stringRdd.toDS)

​


2017-03-16 2:11 GMT+09:00 Jörn Franke :

> Hi,
>
> The Spark CSV parser has different parsing modes:
> * permissive (default) tries to read everything and missing tokens are
> interpreted as null and extra tokens are ignored
> * dropmalformed drops lines which have more or less tokens
> * failfast - runtimexception if there is a malformed line
> Obvious this does not handle malformed gzip (you may ask the sender of the
> gzip to improve their application).
>
> You can adapt the line you mentioned (not sure which Spark version this
> is), but you may not want to do it, because this would mean to maintain an
> own Spark version.
>
> You can write your own datasource (i.e. different namespace than Spark
> CSV) Then, you can also think about a lot of optimisations compared to the
> Spark csv parser, which - depending on the csv and your analysis needs -
> can make processing much more faster.
>
> You could also add a new compressioncodec that ignores broken gzips. In
> this case you would not need an own data source.
>
> Best regards
>
> On 15 Mar 2017, at 16:56, Nathan Case  wrote:
>
> Accidentally sent this to the dev mailing list, meant to send it here.
>
> I have a spark java application that in the past has used the hadoopFile
> interface to specify a custom TextInputFormat to be used when reading
> files.  This custom class would gracefully handle exceptions like EOF
> exceptions caused by corrupt gzip files in the input data.  I recently
> switched to using the csv parser built into spark but am now faced with the
> problem that anytime a bad input file is encountered my whole job fails.
>
> My code to load the data using csv is:
>
> Dataset csv = sparkSession.read()
> .option("delimiter", parseSettings.getDelimiter().toString())
> .option("quote", parseSettings.getQuote())
> .option("parserLib", "UNIVOCITY")
> .csv(paths);
>
> Previously I would load the data using:
>
> JavaRDD lines = sc.newAPIHadoopFile(filePaths, NewInputFormat.class,
> LongWritable.class, Text.class, sc.hadoopConfiguration())
> .values()
> .map(Text::toString);
>
>
> Looking at the CSVFileFormat.scala class it looks like in the private
> readText method if I would overwrite where it passes TextInputFormat to the
> hadoopFile method with my customer format I would be able to achieve what I
> want.
>
> private def readText(
> sparkSession: SparkSession,
> options: CSVOptions,
> location: String): RDD[String] = {
>   if (Charset.forName(options.charset) == StandardCharsets.UTF_8) {
> sparkSession.sparkContext.textFile(location)
>   } else {
> val charset = options.charset
> sparkSession.sparkContext
>
>// This is where I would want to be able to specify my
>
>// input format instead of TextInputFormat
>
>   .hadoopFile[LongWritable, Text, TextInputFormat](location)
>   .mapPartitions(_.map(pair => new String(pair._2.getBytes, 0, 
> pair._2.getLength, charset)))
>   }
> }
>
>
> Does anyone know if there is another way to prevent the corrupt files from
> failing my job or could help to make the required changes to make the
> TextInputFormat customizable as I have only just started looking at scala.
>
> Thanks,
> Nathan
>
>


Re: DataFrameWriter - Where to find list of Options applicable to particular format(datasource)

2017-03-13 Thread Hyukjin Kwon
Hi, all the options are documented in
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameWriter

It seems we don't have both options for writing. If the goal is trimming
the whitespaces, I think we could do this within dataframe operations (as
we talked in the JIRA - https://issues.apache.org/jira/browse/SPARK-18579).



2017-03-14 9:20 GMT+09:00 Nirav Patel :

> Hi,
>
> Is there a document for each datasource (csv, tsv, parquet, json, avro)
> with available options ?  I need to find one for csv to
> "ignoreLeadingWhiteSpace" and "ignoreTrailingWhiteSpace"
>
> Thanks
>
>
>
> [image: What's New with Xactly] 
>
>   [image: LinkedIn]
>   [image: Twitter]
>   [image: Facebook]
>   [image: YouTube]
> 


Re: using spark-xml_2.10 to extract data from XML file

2017-02-14 Thread Hyukjin Kwon
Hi Carlo,


There was a bug in lower versions when accessing to nested values in the
library.

Otherwise, I suspect another issue about parsing malformed XML.

Could you maybe open an issue in
https://github.com/databricks/spark-xml/issues with your sample data?

I will stick with it until it is solved.


Thanks.



2017-02-15 5:04 GMT+09:00 Carlo.Allocca :

> more specifically:
>
> Given the following XML data structure:
>
> This is the Structure of the XML file:
>
> xocs:doc
>  |-- xocs:item: struct (nullable = true)
>  ||-- bibrecord: struct (nullable = true)
>  |||-- head: struct (nullable = true)
>  ||||-- abstracts: struct (nullable = true)
>  |||||-- abstract: struct (nullable = true)
>  ||||||-- _original: string (nullable = true)
>  ||||||-- _lang: string (nullable = true)
>  ||||||-- ce:para: string (nullable = true)
>
>
>
> CASE 1:
>
> String rowTag="abstracts”;
> Dataset df = (new XmlReader()).withAttributePrefix("_").
> withRowTag(rowTag).xmlFile(sqlContext, localxml);
> df.select(df.col("abstract.ce:para"), 
> df.col("abstract._original"),df.col("abstract._lang")
> ).show();
>
> *I got the right values. *
>
> CASE 2:
>
> String rowTag="xocs:doc";
> Dataset df = (new XmlReader()).withAttributePrefix("_").
> withRowTag(rowTag).xmlFile(sqlContext, localxml);
> df.select(df.col("xocs:item.item.bibrecord.head.abstracts.
> abstract.ce:para")).show();
>
> *I got null values.*
>
>
> My question is: How Can I get it right to use String rowTag="xocs:doc”;
> and get the right values for  ….abstract.ce:para, etc? what am I doing
> wrong?
>
> Many Thanks in advance.
> Best Regards,
> Carlo
>
>
>
> On 14 Feb 2017, at 17:35, carlo allocca  wrote:
>
> Dear All,
>
> I would like to ask you help about the following issue when using
> spark-xml_2.10:
>
> Given a XML file with the following structure:
>
> xocs:doc
>  |-- xocs:item: struct (nullable = true)
>  ||-- bibrecord: struct (nullable = true)
>  |||-- head: struct (nullable = true)
>  ||||-- abstracts: struct (nullable = true)
>  |||||-- abstract: struct (nullable = true)
>  ||||||-- _original: string (nullable = true)
>  ||||||-- _lang: string (nullable = true)
>  ||||||-- ce:para: string (nullable = true)
>
> Using the below code to extract all the info from the abstract:
>
> 1) I got “null" for each three values: _original, _lang and ce:para when I
> use rowTag = “xocs:doc”.
> 2) I got the right values when I use rowTag = “abstracts”.
>
> Of course, I need to write a general parser that works at xocs:doc level.
> I have been reading the documentation at https://github.com/
> databricks/spark-xml but I did not help me much to solve the above issue.
>
> Am I doing something wrong? or it may be related to bug the library I am
> using?
>
> Please, could you advice?
>
> Many Thanks,
> Best Regards,
> carlo
>
>
>
>
>
> === Code:
> public static void main(String arr[]) {
>
> // xocs:item/item/bibrecord/head/abstracts  section
> StructType _abstract = new StructType(new StructField[]{
> new StructField("_original", DataTypes.StringType, true,
> Metadata.empty()),
> new StructField("_lang", DataTypes.StringType, true,
> Metadata.empty()),
> new StructField("ce:para", DataTypes.StringType, true,
> Metadata.empty())
> });
> StructType _abstracts = new StructType(new StructField[]{
> new StructField("abstract", _abstract, true, Metadata.empty())
> });
>
> StructType _head = new StructType(new StructField[]{
> new StructField("abstracts", _abstracts, true,
> Metadata.empty())
> });
>
> StructType _bibrecord = new StructType(new StructField[]{
> new StructField("head", _head, true, Metadata.empty())
>
> });
>
> StructType _item = new StructType(new StructField[]{
> new StructField("bibrecord", _bibrecord, true,
> Metadata.empty())
> });
>
> StructType _xocs_item = new StructType(new StructField[]{
> new StructField("item", _item, true, Metadata.empty()),});
>
> StructType rexploreXMLDataSchema = new StructType(new
> StructField[]{
> new StructField("xocs:item", _xocs_item, true,
> Metadata.empty()),});
>
> String localxml = “..filename.xml";
>
> SparkSession spark = SparkSession
> .builder()
> .master("local[2]")
> .appName("DatasetForCaseNew")
> .getOrCreate();
>
> String rowTag = "xocs:doc";
>
>
>
> SQLContext sqlContext = new SQLContext(spark);
> Dataset df = sqlContext.read()
> .format("com.databricks.spark.xml")
> .option("rowTag", rowTag)
> 

Re: JavaRDD text matadata(file name) findings

2017-01-31 Thread Hyukjin Kwon
Hi,

Are you maybe possible to switch it to text datasource with input_file_name
function?

Thanks.

On 1 Feb 2017 3:58 a.m., "Manohar753" 
wrote:

Hi All,
myspark job is reading data from a folder having different files with same
structured data.
the red JavaRdd processed line by line but is there any way to know from
which file the line of data came.
Team thank you in advance for your reply coming.

Thanks,



--
View this message in context: http://apache-spark-user-list.
1001560.n3.nabble.com/JavaRDD-text-matadata-file-name-findings-tp28353.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org


Re: Scala Developers

2017-01-25 Thread Hyukjin Kwon
Just as a subscriber in this mailing list, I don't want to recieve job
recruiting emails and even make some efforts to set a filter for it.

I don't know the policy in details but I feel inappropriate to send them
where, in my experience, Spark users usually ask some questions and discuss
about Spark itself.

Please let me know if it is legitimate. I will stop complaining and try to
set a filter at my side.


On 25 Jan 2017 5:25 p.m., "marcos rebelo"  wrote:

> Hy all,
>
> I’m looking for Scala Developers willing to work on Berlin. We are working
> with Spark, AWS (the latest product are being prototyped StepFunctions,
> Batch Service, and old services Lambda Function, DynamoDB, ...) and
> building Data Products (JSON REST endpoints).
>
> We are responsible to chose the perfect article to the client, collect all
> the data on the site, …  We are responsible to build Recommenders, Site
> Search, … The technology is always changing, so you can learn on the job.
>
> Do you see yourself working as a Data Engineer? Contact me, I promise a
> great team.
>
> Best Regards
>
> Marcos Rebelo
>


Re: filter rows by all columns

2017-01-16 Thread Hyukjin Kwon
Hi Shawn,

Could we do this as below?

 for any of true

scala> val df = spark.range(10).selectExpr("id as a", "id / 2 as b")
df: org.apache.spark.sql.DataFrame = [a: bigint, b: double]

scala> df.filter(_.toSeq.exists(v => v == 1)).show()
+---+---+
|  a|  b|
+---+---+
|  1|0.5|
|  2|1.0|
+---+---+

​

or for all of true

scala> val df = spark.range(10).selectExpr("id as a", "id / 2 as b")
df: org.apache.spark.sql.DataFrame = [a: bigint, b: double]

scala> df.filter(_.toSeq.forall(v => v == 0)).show()
+---+---+
|  a|  b|
+---+---+
|  0|0.0|
+---+---+

​





2017-01-17 7:27 GMT+09:00 Shawn Wan :

> I need to filter out outliers from a dataframe by all columns. I can
> manually list all columns like:
>
> df.filter(x=>math.abs(x.get(0).toString().toDouble-means(0))<=3*stddevs(0
> ))
>
> .filter(x=>math.abs(x.get(1).toString().toDouble-means(1))<=3*stddevs(
> 1))
>
> ...
>
> But I want to turn it into a general function which can handle variable
> number of columns. How could I do that? Thanks in advance!
>
>
> Regards,
>
> Shawn
>
> --
> View this message in context: filter rows by all columns
> 
> Sent from the Apache Spark User List mailing list archive
>  at Nabble.com.
>


Re: Unable to explain the job kicked off for spark.read.csv

2017-01-08 Thread Hyukjin Kwon
Oh, I mean another job would *not* happen if the schema is explicitly given.

2017-01-09 16:37 GMT+09:00 Hyukjin Kwon <gurwls...@gmail.com>:

> Hi Appu,
>
>
> I believe that textFile and filter came from...
>
> https://github.com/apache/spark/blob/branch-2.1/sql/
> core/src/main/scala/org/apache/spark/sql/execution/
> datasources/csv/CSVFileFormat.scala#L59-L61
>
>
> It needs to read a first line even if using the header is disabled and
> schema inference is disabled because we need anyway need a default string
> schema
>
> which having the number of fields same with the first row, "_c#" where #
> is its position of fields if the schema is not specified manually.
>
> I believe that another job would happen if the schema is explicitly given
>
>
> I hope this is helpful
>
>
> Thanks.
>
> 2017-01-09 0:11 GMT+09:00 Appu K <kut...@gmail.com>:
>
>> I was trying to create a base-data-frame in an EMR cluster from a csv
>> file using
>>
>> val baseDF = spark.read.csv("s3://l4b-d4t4/wikipedia/pageviews-by-second-
>> tsv”)
>>
>> Omitted the options to infer the schema and specify the header, just to
>> understand what happens behind the screen.
>>
>>
>> The Spark UI shows that this kicked off a job with one stage.The stage
>> shows that a filter was applied
>>
>> Got curious a little bit about this. Is there any place where i could
>> better understand why a filter was applied here and why there was an action
>> in this case
>>
>>
>> thanks
>>
>>
>


Re: Unable to explain the job kicked off for spark.read.csv

2017-01-08 Thread Hyukjin Kwon
Hi Appu,


I believe that textFile and filter came from...

https://github.com/apache/spark/blob/branch-2.1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala#L59-L61


It needs to read a first line even if using the header is disabled and
schema inference is disabled because we need anyway need a default string
schema

which having the number of fields same with the first row, "_c#" where # is
its position of fields if the schema is not specified manually.

I believe that another job would happen if the schema is explicitly given


I hope this is helpful


Thanks.

2017-01-09 0:11 GMT+09:00 Appu K :

> I was trying to create a base-data-frame in an EMR cluster from a csv file
> using
>
> val baseDF = spark.read.csv("s3://l4b-d4t4/wikipedia/pageviews-by-second-
> tsv”)
>
> Omitted the options to infer the schema and specify the header, just to
> understand what happens behind the screen.
>
>
> The Spark UI shows that this kicked off a job with one stage.The stage
> shows that a filter was applied
>
> Got curious a little bit about this. Is there any place where i could
> better understand why a filter was applied here and why there was an action
> in this case
>
>
> thanks
>
>


Re: Merging Parquet Files

2016-12-22 Thread Hyukjin Kwon
Hi Benjamin,


As you might already know, I believe the Hadoop command automatically does
not merge the column-based format such as ORC or Parquet but just simply
concatenates them.

I haven't tried this by myself but I remember I saw a JIRA in Parquet -
https://issues.apache.org/jira/browse/PARQUET-460

It seems parquet-tools allows merge small Parquet files into one.


Also, I believe there are command-line tools in Kite -
https://github.com/kite-sdk/kite

This might be useful.


Thanks!

2016-12-23 7:01 GMT+09:00 Benjamin Kim :

> Has anyone tried to merge *.gz.parquet files before? I'm trying to merge
> them into 1 file after they are output from Spark. Doing a coalesce(1) on
> the Spark cluster will not work. It just does not have the resources to do
> it. I'm trying to do it using the commandline and not use Spark. I will use
> this command in shell script. I tried "hdfs dfs -getmerge", but the file
> becomes unreadable by Spark with gzip footer error.
>
> Thanks,
> Ben
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: get corrupted rows using columnNameOfCorruptRecord

2016-12-07 Thread Hyukjin Kwon
Let me please just extend the suggestion a bit more verbosely.

I think you could try something like this maybe.

val jsonDF = spark.read
  .option("columnNameOfCorruptRecord", "xxx")
  .option("mode","PERMISSIVE")
  .schema(StructType(schema.fields :+ StructField("xxx", StringType, true)))
  .json(corruptRecords)
val malformed = jsonDF.filter("xxx is not null").select("xxx")
malformed.show()

This prints something like the ones below:

++
| xxx|
++
|   {|
|{"a":1, b:2}|
|{"a":{, b:3}|
|   ]|
++

​

If the schema is not specified, then the inferred schema has the malformed
column automatically

but in case of specifying the schema, I believe this should be manually set.




2016-12-07 18:06 GMT+09:00 Yehuda Finkelstein :

> Hi
>
>
>
> I tried it already but it say that this column doesn’t exists.
>
>
>
> scala> var df = spark.sqlContext.read.
>
>  | option("columnNameOfCorruptRecord","xxx").
>
>  | option("mode","PERMISSIVE").
>
>  | schema(df_schema.schema).json(f)
>
> df: org.apache.spark.sql.DataFrame = [auctionid: string, timestamp: string
> ... 37 more fields]
>
>
>
> scala> df.select
>
> select   selectExpr
>
>
>
> scala> df.select("xxx").show
>
> org.apache.spark.sql.AnalysisException: cannot resolve '`xxx`' given
> input columns: […];;
>
>
>
>   at org.apache.spark.sql.catalyst.analysis.package$
> AnalysisErrorAt.failAnalysis(package.scala:42)
>
>   at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$
> anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(
> CheckAnalysis.scala:77)
>
>   at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$
> anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(
> CheckAnalysis.scala:74)
>
>   at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$
> transformUp$1.apply(TreeNode.scala:308)
>
>   at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$
> transformUp$1.apply(TreeNode.scala:308)
>
>   at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.
> withOrigin(TreeNode.scala:69)
>
>   at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(
> TreeNode.scala:307)
>
>   at org.apache.spark.sql.catalyst.plans.QueryPlan.
> transformExpressionUp$1(QueryPlan.scala:269)
>
>   at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$
> spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$
> 2(QueryPlan.scala:279)
>
>   at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$
> apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(
> QueryPlan.scala:283)
>
>   at scala.collection.TraversableLike$$anonfun$map$
> 1.apply(TraversableLike.scala:234)
>
>   at scala.collection.TraversableLike$$anonfun$map$
> 1.apply(TraversableLike.scala:234)
>
>   at scala.collection.mutable.ResizableArray$class.foreach(
> ResizableArray.scala:59)
>
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>
>   at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$
> spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$
> 2(QueryPlan.scala:283)
>
>   at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$8.
> apply(QueryPlan.scala:288)
>
>   at org.apache.spark.sql.catalyst.trees.TreeNode.
> mapProductIterator(TreeNode.scala:186)
>
>   at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(
> QueryPlan.scala:288)
>
>   at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$
> anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:74)
>
>   at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$
> anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67)
>
>   at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(
> TreeNode.scala:126)
>
>   at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.
> checkAnalysis(CheckAnalysis.scala:67)
>
>   at org.apache.spark.sql.catalyst.analysis.Analyzer.
> checkAnalysis(Analyzer.scala:58)
>
>   at org.apache.spark.sql.execution.QueryExecution.
> assertAnalyzed(QueryExecution.scala:49)
>
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)
>
>   at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$
> withPlan(Dataset.scala:2603)
>
>   at org.apache.spark.sql.Dataset.select(Dataset.scala:969)
>
>   at org.apache.spark.sql.Dataset.select(Dataset.scala:987)
>
>   ... 48 elided
>
>
>
> scala>
>
>
>
>
>
> *From:* Michael Armbrust [mailto:mich...@databricks.com]
> *Sent:* Tuesday, December 06, 2016 10:26 PM
> *To:* Yehuda Finkelstein
> *Cc:* user
> *Subject:* Re: get corrupted rows using columnNameOfCorruptRecord
>
>
>
> .where("xxx IS NOT NULL") will give you the rows that couldn't be parsed.
>
>
>
> On Tue, Dec 6, 2016 at 6:31 AM, Yehuda Finkelstein <
> yeh...@veracity-group.com> wrote:
>
> Hi all
>
>
>
> I’m trying to parse json using existing schema and got rows with NULL’s
>
> //get schema
>
> 

Re: How do I flatten JSON blobs into a Data Frame using Spark/Spark SQL

2016-12-05 Thread Hyukjin Kwon
Hi Kant,

Ah, I thought you wanted to find the workaround to so it.

Then wouldn't this be easily able to reach the same goal with the
workaround without new such API?

Thanks.




On 6 Dec 2016 4:11 a.m., "kant kodali" <kanth...@gmail.com> wrote:

> Hi Kwon,
>
> Thanks for this but Isn't this what Michael suggested?
>
> Thanks,
> kant
>
> On Mon, Dec 5, 2016 at 4:45 AM, Hyukjin Kwon <gurwls...@gmail.com> wrote:
>
>> Hi Kant,
>>
>> How about doing something like this?
>>
>> import org.apache.spark.sql.functions._
>>
>> // val df2 = df.select(df("body").cast(StringType).as("body"))
>> val df2 = Seq("""{"a": 1}""").toDF("body")
>> val schema = spark.read.json(df2.as[String].rdd).schema
>> df2.select(from_json(col("body"), schema)).show()
>>
>> ​
>>
>> 2016-12-05 19:51 GMT+09:00 kant kodali <kanth...@gmail.com>:
>>
>>> Hi Michael,
>>>
>>> " Personally, I usually take a small sample of data and use schema
>>> inference on that.  I then hardcode that schema into my program.  This
>>> makes your spark jobs much faster and removes the possibility of the schema
>>> changing underneath the covers."
>>>
>>> This may or may not work for us. Not all rows have the same schema. The
>>> number of distinct schemas we have now may be small but going forward this
>>> can go to any number moreover a distinct call can lead to a table scan
>>> which can be billions of rows for us.
>>>
>>> I also would agree to keep the API consistent than making an exception
>>> however I wonder if it make sense to provide an action call to infer the
>>> schema which would return a new dataframe after the action call finishes
>>> (after schema inference)? For example, something like below ?
>>>
>>> val inferedDF = df.inferSchema(col1);
>>>
>>> Thanks,
>>>
>>>
>>>
>>>
>>> On Mon, Nov 28, 2016 at 6:12 PM, Michael Armbrust <
>>> mich...@databricks.com> wrote:
>>>
>>>> You could open up a JIRA to add a version of from_json that supports
>>>> schema inference, but unfortunately that would not be super easy to
>>>> implement.  In particular, it would introduce a weird case where only this
>>>> specific function would block for a long time while we infer the schema
>>>> (instead of waiting for an action).  This blocking would be kind of odd for
>>>> a call like df.select(...).  If there is enough interest, though, we
>>>> should still do it.
>>>>
>>>> To give a little more detail, your version of the code is actually
>>>> doing two passes over the data: one to infer the schema and a second for
>>>> whatever processing you are asking it to do.  We have to know the schema at
>>>> each step of DataFrame construction, so we'd have to do this even before
>>>> you called an action.
>>>>
>>>> Personally, I usually take a small sample of data and use schema
>>>> inference on that.  I then hardcode that schema into my program.  This
>>>> makes your spark jobs much faster and removes the possibility of the schema
>>>> changing underneath the covers.
>>>>
>>>> Here's some code I use to build the static schema code automatically
>>>> <https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1023043053387187/1128172975083446/2840265927289860/latest.html>
>>>> .
>>>>
>>>> Would that work for you? If not, why not?
>>>>
>>>> On Wed, Nov 23, 2016 at 2:48 AM, kant kodali <kanth...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Michael,
>>>>>
>>>>> Looks like all from_json functions will require me to pass schema and
>>>>> that can be little tricky for us but the code below doesn't require me to
>>>>> pass schema at all.
>>>>>
>>>>> import org.apache.spark.sql._
>>>>> val rdd = df2.rdd.map { case Row(j: String) => j }
>>>>> spark.read.json(rdd).show()
>>>>>
>>>>>
>>>>> On Tue, Nov 22, 2016 at 2:42 PM, Michael Armbrust <
>>>>> mich...@databricks.com> wrote:
>>>>>
>>>>>> The first release candidate should be coming out this week. You can
>>>>>> subscribe to the dev list if you want to follow the release schedule.
>>>>>>
>>>>>> On Mon, Nov 21, 2016 at 9:34 PM, kant kodali <kanth...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Michael,
>>>>>>>
>>>>>>> I only see spark 2.0.2 which is what I am using currently. Any idea
>>>>>>> on when 2.1 will be released?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> kant
>>>>>>>
>>>>>>> On Mon, Nov 21, 2016 at 5:12 PM, Michael Armbrust <
>>>>>>> mich...@databricks.com> wrote:
>>>>>>>
>>>>>>>> In Spark 2.1 we've added a from_json
>>>>>>>> <https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L2902>
>>>>>>>> function that I think will do what you want.
>>>>>>>>
>>>>>>>> On Fri, Nov 18, 2016 at 2:29 AM, kant kodali <kanth...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> This seem to work
>>>>>>>>>
>>>>>>>>> import org.apache.spark.sql._
>>>>>>>>> val rdd = df2.rdd.map { case Row(j: String) => j }
>>>>>>>>> spark.read.json(rdd).show()
>>>>>>>>>
>>>>>>>>> However I wonder if this any inefficiency here ? since I have to
>>>>>>>>> apply this function for billion rows.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>


Re: How do I flatten JSON blobs into a Data Frame using Spark/Spark SQL

2016-12-05 Thread Hyukjin Kwon
Hi Kant,

How about doing something like this?

import org.apache.spark.sql.functions._

// val df2 = df.select(df("body").cast(StringType).as("body"))
val df2 = Seq("""{"a": 1}""").toDF("body")
val schema = spark.read.json(df2.as[String].rdd).schema
df2.select(from_json(col("body"), schema)).show()

​

2016-12-05 19:51 GMT+09:00 kant kodali :

> Hi Michael,
>
> " Personally, I usually take a small sample of data and use schema
> inference on that.  I then hardcode that schema into my program.  This
> makes your spark jobs much faster and removes the possibility of the schema
> changing underneath the covers."
>
> This may or may not work for us. Not all rows have the same schema. The
> number of distinct schemas we have now may be small but going forward this
> can go to any number moreover a distinct call can lead to a table scan
> which can be billions of rows for us.
>
> I also would agree to keep the API consistent than making an exception
> however I wonder if it make sense to provide an action call to infer the
> schema which would return a new dataframe after the action call finishes
> (after schema inference)? For example, something like below ?
>
> val inferedDF = df.inferSchema(col1);
>
> Thanks,
>
>
>
>
> On Mon, Nov 28, 2016 at 6:12 PM, Michael Armbrust 
> wrote:
>
>> You could open up a JIRA to add a version of from_json that supports
>> schema inference, but unfortunately that would not be super easy to
>> implement.  In particular, it would introduce a weird case where only this
>> specific function would block for a long time while we infer the schema
>> (instead of waiting for an action).  This blocking would be kind of odd for
>> a call like df.select(...).  If there is enough interest, though, we
>> should still do it.
>>
>> To give a little more detail, your version of the code is actually doing
>> two passes over the data: one to infer the schema and a second for whatever
>> processing you are asking it to do.  We have to know the schema at each
>> step of DataFrame construction, so we'd have to do this even before you
>> called an action.
>>
>> Personally, I usually take a small sample of data and use schema
>> inference on that.  I then hardcode that schema into my program.  This
>> makes your spark jobs much faster and removes the possibility of the schema
>> changing underneath the covers.
>>
>> Here's some code I use to build the static schema code automatically
>> 
>> .
>>
>> Would that work for you? If not, why not?
>>
>> On Wed, Nov 23, 2016 at 2:48 AM, kant kodali  wrote:
>>
>>> Hi Michael,
>>>
>>> Looks like all from_json functions will require me to pass schema and
>>> that can be little tricky for us but the code below doesn't require me to
>>> pass schema at all.
>>>
>>> import org.apache.spark.sql._
>>> val rdd = df2.rdd.map { case Row(j: String) => j }
>>> spark.read.json(rdd).show()
>>>
>>>
>>> On Tue, Nov 22, 2016 at 2:42 PM, Michael Armbrust <
>>> mich...@databricks.com> wrote:
>>>
 The first release candidate should be coming out this week. You can
 subscribe to the dev list if you want to follow the release schedule.

 On Mon, Nov 21, 2016 at 9:34 PM, kant kodali 
 wrote:

> Hi Michael,
>
> I only see spark 2.0.2 which is what I am using currently. Any idea on
> when 2.1 will be released?
>
> Thanks,
> kant
>
> On Mon, Nov 21, 2016 at 5:12 PM, Michael Armbrust <
> mich...@databricks.com> wrote:
>
>> In Spark 2.1 we've added a from_json
>> 
>> function that I think will do what you want.
>>
>> On Fri, Nov 18, 2016 at 2:29 AM, kant kodali 
>> wrote:
>>
>>> This seem to work
>>>
>>> import org.apache.spark.sql._
>>> val rdd = df2.rdd.map { case Row(j: String) => j }
>>> spark.read.json(rdd).show()
>>>
>>> However I wonder if this any inefficiency here ? since I have to
>>> apply this function for billion rows.
>>>
>>>
>>
>

>>>
>>
>


Re: Handling windows characters with Spark CSV on Linux

2016-11-17 Thread Hyukjin Kwon
Actually, CSV datasource supports encoding option[1] (although it does not
support non-ascii compatible encoding types).

[1]
https://github.com/apache/spark/blob/44c8bfda793b7655e2bd1da5e9915a09ed9d42ce/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L364

On 17 Nov 2016 10:59 p.m., "ayan guha"  wrote:

> There is an utility called dos2unix. You can give it a try
>
> On 18 Nov 2016 00:20, "Jörn Franke"  wrote:
> >
> > You can do the conversion of character set (is this the issue?) as part
> of your loading process in Spark.
> > As far as i know the spark CSV package is based on Hadoop
> TextFileInputformat. This format to my best of knowledge supports only
> utf-8. So you have to do a conversion from windows to utf-8. If you refer
> to language specific settings (numbers, dates etc) - this is also not
> supported.
> >
> > I started to work on the hadoopoffice library (which you can use with
> Spark) where you can read Excel files directly (
> https://github.com/ZuInnoTe/hadoopoffice).However, there is no official
> release - yet. There you can specify also the language in which you want to
> represent data values, numbers etc. when reading the file.
> >
> > On 17 Nov 2016, at 14:11, Mich Talebzadeh 
> wrote:
> >
> >> Hi,
> >>
> >> In the past with Databricks package for csv files on occasions I had to
> do some cleaning at Linux directory level before ingesting CSV file into
> HDFS staging directory for Spark to read it.
> >>
> >> I have a more generic issue that may have to be ready.
> >>
> >> Assume that a provides using FTP to push CSV files into Windows
> directories. The whole solution is built around windows and .NET.
> >>
> >> Now you want to ingest those files into HDFS and process them with
> Spark CSV.
> >>
> >> One can create NFS directories visible to Windows server and HDFS
> as well. However, there may be issues with character sets etc. What are the
> best ways of handling this? One way would be to use some scripts to make
> these spreadsheet time files compatible with Linux and then load them into
> HDFS. For example I know that if I saved a Excel spresheet file with DOS
> FORMAT, that file will work OK with Spark CSV.  Are there tools to do this
> as well?
> >>
> >> Thanks
> >>
> >>
> >> Dr Mich Talebzadeh
> >>
> >>
> >>
> >> LinkedIn  https://www.linkedin.com/profile/view?id=
> AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> >>
> >>
> >>
> >> http://talebzadehmich.wordpress.com
> >>
> >>
> >> Disclaimer: Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
> >>
> >>
>


Re: Spark-xml - OutOfMemoryError: Requested array size exceeds VM limit

2016-11-16 Thread Hyukjin Kwon
It seems a bit weird. Could we open an issue and talk in the repository
link I sent?

Let me try to reproduce your case with your data if possible.

On 17 Nov 2016 2:26 a.m., "Arun Patel" <arunp.bigd...@gmail.com> wrote:

> I tried below options.
>
> 1) Increase executor memory.  Increased up to maximum possibility 14GB.
> Same error.
> 2) Tried new version - spark-xml_2.10:0.4.1.  Same error.
> 3) Tried with low level rowTags.  It worked for lower level rowTag and
> returned 16000 rows.
>
> Are there any workarounds for this issue?  I tried playing with 
> spark.memory.fraction
> and spark.memory.storageFraction.  But, it did not help.  Appreciate your
> help on this!!!
>
>
>
> On Tue, Nov 15, 2016 at 8:44 PM, Arun Patel <arunp.bigd...@gmail.com>
> wrote:
>
>> Thanks for the quick response.
>>
>> Its a single XML file and I am using a top level rowTag.  So, it creates
>> only one row in a Dataframe with 5 columns. One of these columns will
>> contain most of the data as StructType.  Is there a limitation to store
>> data in a cell of a Dataframe?
>>
>> I will check with new version and try to use different rowTags and
>> increase executor-memory tomorrow. I will open a new issue as well.
>>
>>
>>
>> On Tue, Nov 15, 2016 at 7:52 PM, Hyukjin Kwon <gurwls...@gmail.com>
>> wrote:
>>
>>> Hi Arun,
>>>
>>>
>>> I have few questions.
>>>
>>> Dose your XML file have like few huge documents? In this case of a row
>>> having a huge size like (like 500MB), it would consume a lot of memory
>>>
>>> becuase at least it should hold a row to iterate if I remember
>>> correctly. I remember this happened to me before while processing a huge
>>> record for test purpose.
>>>
>>>
>>> How about trying to increase --executor-memory?
>>>
>>>
>>> Also, you could try to select only few fields to prune the data with the
>>> latest version just to doubly sure if you don't mind?.
>>>
>>>
>>> Lastly, do you mind if I ask to open an issue in
>>> https://github.com/databricks/spark-xml/issues if you still face this
>>> problem?
>>>
>>> I will try to take a look at my best.
>>>
>>>
>>> Thank you.
>>>
>>>
>>> 2016-11-16 9:12 GMT+09:00 Arun Patel <arunp.bigd...@gmail.com>:
>>>
>>>> I am trying to read an XML file which is 1GB is size.  I am getting an
>>>> error 'java.lang.OutOfMemoryError: Requested array size exceeds VM
>>>> limit' after reading 7 partitions in local mode.  In Yarn mode, it
>>>> throws 'java.lang.OutOfMemoryError: Java heap space' error after
>>>> reading 3 partitions.
>>>>
>>>> Any suggestion?
>>>>
>>>> PySpark Shell Command:pyspark --master local[4] --driver-memory 3G
>>>> --jars / tmp/spark-xml_2.10-0.3.3.jar
>>>>
>>>>
>>>>
>>>> Dataframe Creation Command:   df = sqlContext.read.format('com.da
>>>> tabricks.spark.xml').options(rowTag='GGL').load('GGL_1.2G.xml')
>>>>
>>>>
>>>>
>>>> 16/11/15 18:27:04 INFO TaskSetManager: Finished task 1.0 in stage 0.0
>>>> (TID 1) in 25978 ms on localhost (1/10)
>>>>
>>>> 16/11/15 18:27:04 INFO NewHadoopRDD: Input split:
>>>> hdfs://singlenodevm:8020/user/arunp/GGL_1.2G.xml:268435456+134217728
>>>>
>>>> 16/11/15 18:27:55 INFO Executor: Finished task 2.0 in stage 0.0 (TID
>>>> 2). 2309 bytes result sent to driver
>>>>
>>>> 16/11/15 18:27:55 INFO TaskSetManager: Starting task 3.0 in stage 0.0
>>>> (TID 3, localhost, partition 3,ANY, 2266 bytes)
>>>>
>>>> 16/11/15 18:27:55 INFO Executor: Running task 3.0 in stage 0.0 (TID 3)
>>>>
>>>> 16/11/15 18:27:55 INFO TaskSetManager: Finished task 2.0 in stage 0.0
>>>> (TID 2) in 51001 ms on localhost (2/10)
>>>>
>>>> 16/11/15 18:27:55 INFO NewHadoopRDD: Input split:
>>>> hdfs://singlenodevm:8020/user/arunp/GGL_1.2G.xml:402653184+134217728
>>>>
>>>> 16/11/15 18:28:19 INFO Executor: Finished task 3.0 in stage 0.0 (TID
>>>> 3). 2309 bytes result sent to driver
>>>>
>>>> 16/11/15 18:28:19 INFO TaskSetManager: Starting task 4.0 in stage 0.0
>>>> (TID 4, localhost, partition 4,ANY, 2266 bytes)
>>>>
>>>> 16/11/15 18:28:19 INFO Executor: Running task 4.0 in stage 0.0 

Re: How do I convert json_encoded_blob_column into a data frame? (This may be a feature request)

2016-11-16 Thread Hyukjin Kwon
Maybe it sounds like you are looking for from_json/to_json functions after
en/decoding properly.

On 16 Nov 2016 6:45 p.m., "kant kodali"  wrote:

>
>
> https://spark.apache.org/docs/2.0.2/sql-programming-guide.
> html#json-datasets
>
> "Spark SQL can automatically infer the schema of a JSON dataset and load
> it as a DataFrame. This conversion can be done using
> SQLContext.read.json() on either an RDD of String, or a JSON file."
>
> val df = spark.sql("SELECT json_encoded_blob_column from table_name"); //
> A cassandra query (cassandra stores blobs in hexadecimal )   
> json_encoded_blob_column
> is encoded in hexadecimal. It will be great to have these blobs interpreted
> and be loaded as a data frame but for now is there anyway to load or parse
> json_encoded_blob_column into a data frame?
>
>
>
>
>
>


Re: Spark-xml - OutOfMemoryError: Requested array size exceeds VM limit

2016-11-15 Thread Hyukjin Kwon
Hi Arun,


I have few questions.

Dose your XML file have like few huge documents? In this case of a row
having a huge size like (like 500MB), it would consume a lot of memory

becuase at least it should hold a row to iterate if I remember correctly. I
remember this happened to me before while processing a huge record for test
purpose.


How about trying to increase --executor-memory?


Also, you could try to select only few fields to prune the data with the
latest version just to doubly sure if you don't mind?.


Lastly, do you mind if I ask to open an issue in
https://github.com/databricks/spark-xml/issues if you still face this
problem?

I will try to take a look at my best.


Thank you.


2016-11-16 9:12 GMT+09:00 Arun Patel :

> I am trying to read an XML file which is 1GB is size.  I am getting an
> error 'java.lang.OutOfMemoryError: Requested array size exceeds VM limit'
> after reading 7 partitions in local mode.  In Yarn mode, it
> throws 'java.lang.OutOfMemoryError: Java heap space' error after reading
> 3 partitions.
>
> Any suggestion?
>
> PySpark Shell Command:pyspark --master local[4] --driver-memory 3G
> --jars / tmp/spark-xml_2.10-0.3.3.jar
>
>
>
> Dataframe Creation Command:   df = sqlContext.read.format('com.da
> tabricks.spark.xml').options(rowTag='GGL').load('GGL_1.2G.xml')
>
>
>
> 16/11/15 18:27:04 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID
> 1) in 25978 ms on localhost (1/10)
>
> 16/11/15 18:27:04 INFO NewHadoopRDD: Input split:
> hdfs://singlenodevm:8020/user/arunp/GGL_1.2G.xml:268435456+134217728
>
> 16/11/15 18:27:55 INFO Executor: Finished task 2.0 in stage 0.0 (TID 2).
> 2309 bytes result sent to driver
>
> 16/11/15 18:27:55 INFO TaskSetManager: Starting task 3.0 in stage 0.0 (TID
> 3, localhost, partition 3,ANY, 2266 bytes)
>
> 16/11/15 18:27:55 INFO Executor: Running task 3.0 in stage 0.0 (TID 3)
>
> 16/11/15 18:27:55 INFO TaskSetManager: Finished task 2.0 in stage 0.0 (TID
> 2) in 51001 ms on localhost (2/10)
>
> 16/11/15 18:27:55 INFO NewHadoopRDD: Input split:
> hdfs://singlenodevm:8020/user/arunp/GGL_1.2G.xml:402653184+134217728
>
> 16/11/15 18:28:19 INFO Executor: Finished task 3.0 in stage 0.0 (TID 3).
> 2309 bytes result sent to driver
>
> 16/11/15 18:28:19 INFO TaskSetManager: Starting task 4.0 in stage 0.0 (TID
> 4, localhost, partition 4,ANY, 2266 bytes)
>
> 16/11/15 18:28:19 INFO Executor: Running task 4.0 in stage 0.0 (TID 4)
>
> 16/11/15 18:28:19 INFO TaskSetManager: Finished task 3.0 in stage 0.0 (TID
> 3) in 24336 ms on localhost (3/10)
>
> 16/11/15 18:28:19 INFO NewHadoopRDD: Input split:
> hdfs://singlenodevm:8020/user/arunp/GGL_1.2G.xml:536870912+134217728
>
> 16/11/15 18:28:40 INFO Executor: Finished task 4.0 in stage 0.0 (TID 4).
> 2309 bytes result sent to driver
>
> 16/11/15 18:28:40 INFO TaskSetManager: Starting task 5.0 in stage 0.0 (TID
> 5, localhost, partition 5,ANY, 2266 bytes)
>
> 16/11/15 18:28:40 INFO Executor: Running task 5.0 in stage 0.0 (TID 5)
>
> 16/11/15 18:28:40 INFO TaskSetManager: Finished task 4.0 in stage 0.0 (TID
> 4) in 20895 ms on localhost (4/10)
>
> 16/11/15 18:28:40 INFO NewHadoopRDD: Input split:
> hdfs://singlenodevm:8020/user/arunp/GGL_1.2G.xml:671088640+134217728
>
> 16/11/15 18:29:01 INFO Executor: Finished task 5.0 in stage 0.0 (TID 5).
> 2309 bytes result sent to driver
>
> 16/11/15 18:29:01 INFO TaskSetManager: Starting task 6.0 in stage 0.0 (TID
> 6, localhost, partition 6,ANY, 2266 bytes)
>
> 16/11/15 18:29:01 INFO Executor: Running task 6.0 in stage 0.0 (TID 6)
>
> 16/11/15 18:29:01 INFO TaskSetManager: Finished task 5.0 in stage 0.0 (TID
> 5) in 20793 ms on localhost (5/10)
>
> 16/11/15 18:29:01 INFO NewHadoopRDD: Input split:
> hdfs://singlenodevm:8020/user/arunp/GGL_1.2G.xml:805306368+134217728
>
> 16/11/15 18:29:22 INFO Executor: Finished task 6.0 in stage 0.0 (TID 6).
> 2309 bytes result sent to driver
>
> 16/11/15 18:29:22 INFO TaskSetManager: Starting task 7.0 in stage 0.0 (TID
> 7, localhost, partition 7,ANY, 2266 bytes)
>
> 16/11/15 18:29:22 INFO Executor: Running task 7.0 in stage 0.0 (TID 7)
>
> 16/11/15 18:29:22 INFO TaskSetManager: Finished task 6.0 in stage 0.0 (TID
> 6) in 21306 ms on localhost (6/10)
>
> 16/11/15 18:29:22 INFO NewHadoopRDD: Input split:
> hdfs://singlenodevm:8020/user/arunp/GGL_1.2G.xml:939524096+134217728
>
> 16/11/15 18:29:43 INFO Executor: Finished task 7.0 in stage 0.0 (TID 7).
> 2309 bytes result sent to driver
>
> 16/11/15 18:29:43 INFO TaskSetManager: Starting task 8.0 in stage 0.0 (TID
> 8, localhost, partition 8,ANY, 2266 bytes)
>
> 16/11/15 18:29:43 INFO Executor: Running task 8.0 in stage 0.0 (TID 8)
>
> 16/11/15 18:29:43 INFO TaskSetManager: Finished task 7.0 in stage 0.0 (TID
> 7) in 21130 ms on localhost (7/10)
>
> 16/11/15 18:29:43 INFO NewHadoopRDD: Input split:
> hdfs://singlenodevm:8020/user/arunp/GGL_1.2G.xml:1073741824+134217728
>
> 16/11/15 18:29:48 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID
> 0)
>
> 

Re: How to read a Multi Line json object via Spark

2016-11-15 Thread Hyukjin Kwon
Hi Sree,


There is a blog about that,
http://searchdatascience.com/spark-adventures-1-processing-multi-line-json-files/

It is pretty old but I am sure that it is helpful.

Currently, JSON datasource only supports to rest JSON documents formatted
according to http://jsonlines.org/

There is an issue open to support this
https://issues.apache.org/jira/browse/SPARK-18352

I hope this is helpful.


Thanks.



2016-11-15 16:20 GMT+09:00 Sree Eedupuganti :

> I tried from Spark-Shell and i am getting the following error:
>
> Here is the test.json file:
>
> {
> "colorsArray": [{
> "red": "#f00",
> "green": "#0f0",
> "blue": "#00f",
> "cyan": "#0ff",
> "magenta": "#f0f",
> "yellow": "#ff0",
> "black": "#000"
> }]}
>
>
> scala> val jtex = 
> sqlContext.read.format("json").option("samplingRatio","1.0").load("/user/spark/test.json")
>
>jtex: org.apache.spark.sql.DataFrame = [_corrupt_record: string]
>
>
> Any suggestions please. Thanks.
> --
> Best Regards,
> Sreeharsha Eedupuganti
> Data Engineer
> innData Analytics Private Limited
>


Re: Spark SQL shell hangs

2016-11-13 Thread Hyukjin Kwon
Hi Rakesh,

Could you please open an issue in https://github.com/databricks/spark-xml
with some codes so that reviewers can reproduce the issue you met?

Thanks!


2016-11-14 0:20 GMT+09:00 rakesh sharma :

> Hi
>
> I'm trying to convert an XML file to data frame using data bricks spark
> XML. But the shell hanhs when I do a select operation on the table. I
> believe it's memory issue. How can I debug this. The cm file sizes 86 MB.
>
> Thanks in advance
> Rakesh
>
> Get Outlook for Android 
>


Re: pyspark: accept unicode column names in DataFrame.corr and cov

2016-11-12 Thread Hyukjin Kwon
Hi Sam,

I think I have some answers for two questions.

> Humble request: could we replace the "isinstance(col1, str)" tests with
"isinstance(col1, basestring)"?

IMHO, yes, I believe this should be basestring. Otherwise, some functions
would not accept unicode as arguments for columns in Python 2.7.

> Less humble request: why test types at all? Why not just do one of {raise
KeyError, coerce to string}?

I believe argument type checking is pretty common in other Python libraries
too such as numpy.
ValueError might be more appropriate because the type of the value is not
correct.
Also, I think forcing it into string might confuse user.

If the current why is problematic and not coherent, I guess you should
change this but I think
it is okay as it is.

Thanks.

2016-11-12 9:36 GMT+09:00 SamPenrose :

> The corr() and cov() methods of DataFrame require an instance of str for
> column names:
>
> .
> https://github.com/apache/spark/blob/master/python/
> pyspark/sql/dataframe.py#L1356
>
> although instances of basestring appear to work for addressing columns:
>
> .
> https://github.com/apache/spark/blob/master/python/
> pyspark/sql/dataframe.py#L708
>
> Humble request: could we replace the "isinstance(col1, str)" tests with
> "isinstance(col1, basestring)"?
>
> Less humble request: why test types at all? Why not just do one of {raise
> KeyError, coerce to string}?
>
> Cheers,
> Sam
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/pyspark-accept-unicode-column-names-
> in-DataFrame-corr-and-cov-tp28065.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Reading csv files with quoted fields containing embedded commas

2016-11-06 Thread Hyukjin Kwon
Hi Femi,

Have you maybe tried the quote related options specified in the
documentation?

http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader.csv

Thanks.

2016-11-06 6:58 GMT+09:00 Femi Anthony :

> Hi, I am trying to process a very large comma delimited csv file and I am
> running into problems.
> The main problem is that some fields contain quoted strings with embedded
> commas.
> It seems as if PySpark is unable to properly parse lines containing such
> fields like say Pandas does.
>
> Here is the code I am using to read the file in Pyspark
>
> df_raw=spark.read.option("header","true").csv(csv_path)
>
> Here is an example of a good and 'bad' line in such a file:
>
>
> col1,col2,col3,col4,col5,col6,col7,col8,col9,col10,col11,
> col12,col13,col14,col15,col16,col17,col18,col19
> 80015360210876000,11.22,X,4076710258,,,sxsw,,"32 YIU ""A""",S5,,"32 XIY
> ""W""   JK, RE LK",SOMETHINGLIKEAPHENOMENON#YOUGOTSOUL~BRINGDANOISE,23.0,
> cyclingstats,2012-25-19,432,2023-05-17,CODERED
> 6167229561918,137.12,U,8234971771,,,woodstock,,,T4,,,
> OUTKAST#THROOTS~WUTANG#RUNDMC,0.0,runstats,2013-21-22,1333,
> 2019-11-23,CODEBLUE
>
> Line 0 is the header
> Line 1 is the 'problematic' line
> Line 2 is a good line.
>
> Pandas can handle this easily:
>
>
> [1]: import pandas as pd
>
> In [2]: pdf = pd.read_csv('malformed_data.csv')
>
> In [4]: pdf[['col12','col13','col14']]
> Out[4]:
> col12
> col13  \
> 0  32 XIY "W"   JK, RE LK  SOMETHINGLIKEAPHENOMENON#
> YOUGOTSOUL~BRINGDANOISE
> 1 NaN OUTKAST#THROOTS~WUTANG#RUNDMC
>
>col14
> 0   23.0
> 10.0
>
>
> while Pyspark seems to parse this erroneously:
>
> [5]: sdf=spark.read.format("org.apache.spark.csv").csv('
> malformed_data.csv',header=True)
>
> [6]: sdf.select("col12","col13",'col14').show()
> +--+++
> | col12|   col13|   col14|
> +--+++
> |"32 XIY ""W""   JK|  RE LK"|SOMETHINGLIKEAPHE...|
> |  null|OUTKAST#THROOTS~W...| 0.0|
> +--+++
>
>  Is this a bug or am I doing something wrong ?
>  I am working with Spark 2.0
>  Any help is appreciated
>
> Thanks,
> -- Femi
>
> http://www.nextmatrix.com
> "Great spirits have always encountered violent opposition from mediocre
> minds." - Albert Einstein.
>


Re: Error creating SparkSession, in IntelliJ

2016-11-03 Thread Hyukjin Kwon
Hi Shyla,

there is the documentation for setting up IDE -
https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-IDESetup

I hope this is helpful.


2016-11-04 9:10 GMT+09:00 shyla deshpande :

> Hello Everyone,
>
> I just installed Spark 2.0.1, spark shell works fine.
>
> Was able to run some simple programs from the Spark Shell, but find it
> hard to make the same program work when using IntelliJ.
>  I am getting the following error.
>
> Exception in thread "main" java.lang.NoSuchMethodError:
> scala.Predef$.$scope()Lscala/xml/TopScope$;
> at org.apache.spark.ui.jobs.StagePage.(StagePage.scala:44)
> at org.apache.spark.ui.jobs.StagesTab.(StagesTab.scala:34)
> at org.apache.spark.ui.SparkUI.(SparkUI.scala:62)
> at org.apache.spark.ui.SparkUI$.create(SparkUI.scala:215)
> at org.apache.spark.ui.SparkUI$.createLiveUI(SparkUI.scala:157)
> at org.apache.spark.SparkContext.(SparkContext.scala:440)
> at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2275)
> at org.apache.spark.sql.SparkSession$Builder$$anonfun$
> 8.apply(SparkSession.scala:831)
> at org.apache.spark.sql.SparkSession$Builder$$anonfun$
> 8.apply(SparkSession.scala:823)
> at scala.Option.getOrElse(Option.scala:121)
> at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.
> scala:823)
> at SparkSessionTest.DataSetWordCount$.main(DataSetWordCount.scala:15)
> at SparkSessionTest.DataSetWordCount.main(DataSetWordCount.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(
> NativeMethodAccessorImpl.java:62)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(
> DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147)
>
> Thanks
> -Shyla
>
>
>


Re: Spark XML ignore namespaces

2016-11-03 Thread Hyukjin Kwon
Oh, that PR was actually about not concerning the namespaces (meaning
leaving data as they are, including prefixes).


The problem was, each partition needs to produce each record with knowing
the namesapces.

It is fine to deal with them if they are within each XML documentation
(represented as a row in dataframe) but

it becomes problematic if they are in the parent of each XML documentation
(represented as a row in dataframe).


There is an issue open for this,
https://github.com/databricks/spark-xml/issues/74

It'd be nicer if we have an option to enable/disable this if we can
properly support namespace handling.


We might be able to talk more there.



2016-11-04 6:37 GMT+09:00 Arun Patel :

> I see that 'ignoring namespaces' issue is resolved.
>
> https://github.com/databricks/spark-xml/pull/75
>
> How do we enable this option and ignore namespace prefixes?
>
> - Arun
>


Re: csv date/timestamp type inference in spark 2.0.1

2016-10-26 Thread Hyukjin Kwon
Hi Koert,


I am curious about your case. I guess the purpose of timestampFormat and
dateFormat is to infer timestamps/dates when parsing/inferring

but not to exclude the type inference/parsing. Actually, it does try to
infer/parse in 2.0.0 as well (but it fails) so actually I guess there
wouldn't be a big performance difference.


I guess it is type inference and therefore it is the right behaviour that
it tries to do its best to infer the appropriate type inclusively.

Why don't you just cast the timestamps to strings?


Thanks.


2016-10-27 9:47 GMT+09:00 Koert Kuipers :

> i tried setting both dateFormat and timestampFormat to impossible values
> (e.g. "~|.G~z~a|wW") and it still detected my data to be TimestampType
>
> On Wed, Oct 26, 2016 at 1:15 PM, Koert Kuipers  wrote:
>
>> we had the inference of dates/timestamps when reading csv files disabled
>> in spark 2.0.0 by always setting dateFormat to something impossible (e.g.
>> dateFormat "~|.G~z~a|wW")
>>
>> i noticed in spark 2.0.1 that setting this impossible dateFormat does not
>> stop spark from inferring it is a date or timestamp type anyhow. is this
>> intentional? how do i disable inference of datetype/timestamp type now?
>>
>> thanks! koert
>>
>>
>


Re: spark infers date to be timestamp type

2016-10-26 Thread Hyukjin Kwon
Hi Koert,


Sorry, I thought you meant this is a regression between 2.0.0 and 2.0.1. I
just checked It has not been supporting to infer DateType before[1].

Yes, it only supports to infer such data as timestamps currently.


[1]
https://github.com/apache/spark/blob/branch-2.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala#L85-L92




2016-10-27 9:12 GMT+09:00 Anand Viswanathan <anand_v...@ymail.com>:

> Hi,
>
> you can use the customSchema(for DateType) and specify dateFormat in
> .option().
> or
> at spark dataframe side, you can convert the timestamp to date using cast
> to the column.
>
> Thanks and regards,
> Anand Viswanathan
>
> On Oct 26, 2016, at 8:07 PM, Koert Kuipers <ko...@tresata.com> wrote:
>
> hey,
> i create a file called test.csv with contents:
> date
> 2015-01-01
> 2016-03-05
>
> next i run this code in spark 2.0.1:
> spark.read
>   .format("csv")
>   .option("header", true)
>   .option("inferSchema", true)
>   .load("test.csv")
>   .printSchema
>
> the result is:
> root
>  |-- date: timestamp (nullable = true)
>
>
> On Wed, Oct 26, 2016 at 7:35 PM, Hyukjin Kwon <gurwls...@gmail.com> wrote:
>
>> There are now timestampFormat for TimestampType and dateFormat for
>> DateType.
>>
>> Do you mind if I ask to share your codes?
>>
>> On 27 Oct 2016 2:16 a.m., "Koert Kuipers" <ko...@tresata.com> wrote:
>>
>>> is there a reason a column with dates in format -mm-dd in a csv file
>>> is inferred to be TimestampType and not DateType?
>>>
>>> thanks! koert
>>>
>>
>
>


Re: spark infers date to be timestamp type

2016-10-26 Thread Hyukjin Kwon
There are now timestampFormat for TimestampType and dateFormat for DateType.

Do you mind if I ask to share your codes?

On 27 Oct 2016 2:16 a.m., "Koert Kuipers"  wrote:

> is there a reason a column with dates in format -mm-dd in a csv file
> is inferred to be TimestampType and not DateType?
>
> thanks! koert
>


Re: pyspark doesn't recognize MMM dateFormat pattern in spark.read.load() for dates like 1989Dec31 and 31Dec1989

2016-10-24 Thread Hyukjin Kwon
I am also interested in this issue. I will try to look into this too within
coming few days..

2016-10-24 21:32 GMT+09:00 Sean Owen :

> I actually think this is a general problem with usage of DateFormat and
> SimpleDateFormat across the code, in that it relies on the default locale
> of the JVM. I believe this needs to, at least, default consistently to
> Locale.US so that behavior is consistent; otherwise it's possible that
> parsing and formatting of dates could work subtly differently across
> environments.
>
> There's a similar question about some code that formats dates for the UI.
> It's more reasonable to let that use the platform-default locale, but, I'd
> still favor standardizing it I think.
>
> Anyway, let me test it out a bit and possibly open a JIRA with this change
> for discussion.
>
> On Mon, Oct 24, 2016 at 1:03 PM pietrop  wrote:
>
> Hi there,
> I opened a question on StackOverflow at this link:
> http://stackoverflow.com/questions/40007972/pyspark-doesnt-recognize-mmm-
> dateformat-pattern-in-spark-read-load-for-dates?
> noredirect=1#comment67297930_40007972
>
> I didn’t get any useful answer, so I’m writing here hoping that someone can
> help me.
>
> In short, I’m trying to read a CSV containing data columns stored using the
> pattern “MMMdd”. What doesn’t work for me is “MMM”. I’ve done some
> testing and discovered that it’s a localization issue. As you can read from
> the StackOverflow question, I run a simple Java code to parse the date
> “1989Dec31” and it works only if I specify Locale.US in the
> SimpleDateFormat() function.
>
> I would like pyspark to work. I tried setting a different local from
> console
> (LANG=“en_US”), but it doesn’t work. I tried also setting it using the
> locale package from Python.
>
> So, there’s a way to set locale in Spark when using pyspark? The issue is
> Java related and not Python related (the function that parses data is
> invoked by spark.read.load(dateFormat=“MMMdd”, …). I don’t want to use
> other solutions in order to encode data because they are slower (from what
> I’ve seen so far).
>
> Thank you
> Pietro
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/pyspark-doesn-t-recognize-MMM-
> dateFormat-pattern-in-spark-read-load-for-dates-like-
> 1989Dec31-and-31D9-tp27951.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: how to extract arraytype data to file

2016-10-18 Thread Hyukjin Kwon
This reminds me of
https://github.com/databricks/spark-xml/issues/141#issuecomment-234835577

Maybe using explode() would be helpful.

Thanks!

2016-10-19 14:05 GMT+09:00 Divya Gehlot :

> http://stackoverflow.com/questions/33864389/how-can-i-
> create-a-spark-dataframe-from-a-nested-array-of-struct-element
>
> Hope this helps
>
>
> Thanks,
> Divya
>
> On 19 October 2016 at 11:35, lk_spark  wrote:
>
>> hi,all:
>> I want to read a json file and search it by sql .
>> the data struct should be :
>>
>> bid: string (nullable = true)
>> code: string (nullable = true)
>>
>> and the json file data should be like :
>>  {bid":"MzI4MTI5MzcyNw==","code":"罗甸网警"}
>>  {"bid":"MzI3MzQ5Nzc2Nw==","code":"西早君"}
>> but in fact my json file data is :
>> {"bizs":[ {bid":"MzI4MTI5MzcyNw==","code
>> ":"罗甸网警"},{"bid":"MzI3MzQ5Nzc2Nw==","code":"西早君"}]}
>> {"bizs":[ {bid":"MzI4MTI5Mzcy00==","code
>> ":"罗甸网警"},{"bid":"MzI3MzQ5Nzc201==","code":"西早君"}]}
>> I load it by spark ,data schema shows like this :
>>
>> root
>>  |-- bizs: array (nullable = true)
>>  ||-- element: struct (containsNull = true)
>>  |||-- bid: string (nullable = true)
>>  |||-- code: string (nullable = true)
>>
>>
>> I can select columns by : df.select("bizs.id","bizs.name")
>> but the colume values is in array type:
>> +++
>> |  id|code|
>> +++
>> |[4938200, 4938201...|[罗甸网警, 室内设计师杨焰红, ...|
>> |[4938300, 4938301...|[SDCS十全九美, 旅梦长大, ...|
>> |[4938400, 4938401...|[日重重工液压行走回转, 氧老家,...|
>> |[4938500, 4938501...|[PABXSLZ, 陈少燕, 笑蜜...|
>> |[4938600, 4938601...|[税海微云, 西域美农云家店, 福...|
>> +++
>>
>> what I want is I can read colum in normal row type. how I can do it ?
>> 2016-10-19
>> --
>> lk_spark
>>
>
>


Re: Why the json file used by sparkSession.read.json must be a valid json object per line

2016-10-18 Thread Hyukjin Kwon
Regarding his recent PR[1], I guess he meant multiple line json.

As far as I know, single line json also conplies the standard. I left a
comment with RFC in the PR but please let me know if I am wrong at any
point.

Thanks!

[1]https://github.com/apache/spark/pull/15511

On 19 Oct 2016 7:00 a.m., "Daniel Barclay" 
wrote:

> Koert,
>
> Koert Kuipers wrote:
>
> A single json object would mean for most parsers it needs to fit in memory
> when reading or writing
>
> Note that codlife didn't seem to being asking about *single-object* JSON
> files, but about *standard-format* JSON files.
>
>
> On Oct 15, 2016 11:09, "codlife" <1004910...@qq.com> wrote:
>
>> Hi:
>>I'm doubt about the design of spark.read.json,  why the json file is
>> not
>> a standard json file, who can tell me the internal reason. Any advice is
>> appreciated.
>>
>>
>>
>> --
>> View this message in context: http://apache-spark-user-list.
>> 1001560.n3.nabble.com/Why-the-json-file-used-by-sparkSession
>> -read-json-must-be-a-valid-json-object-per-line-tp27907.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>


Re: JSON Arrays and Spark

2016-10-12 Thread Hyukjin Kwon
No, I meant it should be in a single line but it supports array type too as
a root wrapper of JSON objects.

If you need to parse multiple lines, I have a reference here.

http://searchdatascience.com/spark-adventures-1-processing-multi-line-json-files/

2016-10-12 15:04 GMT+09:00 Kappaganthu, Sivaram (ES) <
sivaram.kappagan...@adp.com>:

> Hi,
>
>
>
> Does this mean that handling any Json with kind of below schema  with
> spark is not a good fit?? I have requirement to parse the below Json that
> spans across multiple lines. Whats the best way to parse the jsns of this
> kind?? Please suggest.
>
>
>
> root
>
> |-- maindate: struct (nullable = true)
>
> ||-- mainidnId: string (nullable = true)
>
> |-- Entity: array (nullable = true)
>
> ||-- element: struct (containsNull = true)
>
> |||-- Profile: struct (nullable = true)
>
> ||||-- Kind: string (nullable = true)
>
> |||-- Identifier: string (nullable = true)
>
> |||-- Group: array (nullable = true)
>
> ||||-- element: struct (containsNull = true)
>
> |||||-- Period: struct (nullable = true)
>
> ||||||-- pid: string (nullable = true)
>
> ||||||-- pDate: string (nullable = true)
>
> ||||||-- quarter: long (nullable = true)
>
> ||||||-- labour: array (nullable = true)
>
> |||||||-- element: struct (containsNull = true)
>
> ||||||||-- category: string (nullable = true)
>
> ||||||||-- id: string (nullable = true)
>
> ||||||||-- person: struct (nullable = true)
>
> |||||||||-- address: array (nullable =
> true)
>
> ||||||||||-- element: struct
> (containsNull = true)
>
> |||||||||||-- city: string
> (nullable = true)
>
> |||||||||||-- line1: string
> (nullable = true)
>
> |||||||||||-- line2: string
> (nullable = true)
>
> |||||||||||-- postalCode: string
> (nullable = true)
>
> |||||||||||-- state: string
> (nullable = true)
>
> |||||||||||-- type: string
> (nullable = true)
>
> |||||||||-- familyName: string (nullable =
> true)
>
> ||||||||-- tax: array (nullable = true)
>
> |||||||||-- element: struct (containsNull
> = true)
>
> ||||||||||-- code: string (nullable =
> true)
>
> ||||||||||-- qwage: double (nullable =
> true)
>
> ||||||||||-- qvalue: double (nullable
> = true)
>
> ||||||||||-- qSubjectvalue: double
> (nullable = true)
>
> ||||||||||-- qfinalvalue: double
> (nullable = true)
>
> ||||||||||-- ywage: double (nullable =
> true)
>
> ||||||||||-- yalue: double (nullable =
> true)
>
> ||||||||||-- ySubjectvalue: double
> (nullable = true)
>
> ||||||||||-- yfinalvalue: double
> (nullable = true)
>
> ||||||||-- tProfile: array (nullable = true)
>
> |||||||||-- element: struct (containsNull
> = true)
>
> ||||||||||-- isExempt: boolean
> (nullable = true)
>
> ||||||||||-- jurisdiction: struct
> (nullable = true)
>
> |||||||||||-- code: string
> (nullable = true)
>
> ||||||||||-- maritalStatus: string
> (nullable = true)
>
> ||||||||||-- numberOfDeductions: long
> (nullable = true)
>
> ||||||||-- wDate: struct (nullable = true)
>
> |||||||||-- originalHireDate: string
> (nullable = true)
>
> ||||||-- year: long (nullable = true)
>
>
>
>
>
> *From:* Luciano Resende [mailto:luckbr1...@gmail.com]
> *Sent:* Monday, October 10, 2016 11:39 PM
> *To:* Jean Georges Perrin
> *Cc:* user @spark
> *Subject:* Re: JSON Arrays and Spark
>
>
>
> Please take a look at
> http://spark.apache.org/docs/latest/sql-programming-guide.
> html#json-datasets
>
> Particularly the note at the required format :
>
> Note that the file that is offered as *a json file* is not a typical JSON
> file. Each line must contain a separate, self-contained valid JSON object.
> As a consequence, a regular multi-line JSON file will most often fail.
>
>
>
> On Mon, Oct 10, 2016 at 9:57 AM, Jean Georges Perrin 

Re: JSON Arrays and Spark

2016-10-10 Thread Hyukjin Kwon
FYI, it supports

[{...}, {...} ...]

Or

{...}

format as input.

On 11 Oct 2016 3:19 a.m., "Jean Georges Perrin"  wrote:

> Thanks Luciano - I think this is my issue :(
>
> On Oct 10, 2016, at 2:08 PM, Luciano Resende  wrote:
>
> Please take a look at
> http://spark.apache.org/docs/latest/sql-programming-guide.
> html#json-datasets
>
> Particularly the note at the required format :
>
> Note that the file that is offered as *a json file* is not a typical JSON
> file. Each line must contain a separate, self-contained valid JSON object.
> As a consequence, a regular multi-line JSON file will most often fail.
>
>
>
> On Mon, Oct 10, 2016 at 9:57 AM, Jean Georges Perrin  wrote:
>
>> Hi folks,
>>
>> I am trying to parse JSON arrays and it’s getting a little crazy (for me
>> at least)…
>>
>> 1)
>> If my JSON is:
>> {"vals":[100,500,600,700,800,200,900,300]}
>>
>> I get:
>> ++
>> |vals|
>> ++
>> |[100, 500, 600, 7...|
>> ++
>>
>> root
>>  |-- vals: array (nullable = true)
>>  ||-- element: long (containsNull = true)
>>
>> and I am :)
>>
>> 2)
>> If my JSON is:
>> [100,500,600,700,800,200,900,300]
>>
>> I get:
>> ++
>> | _corrupt_record|
>> ++
>> |[100,500,600,700,...|
>> ++
>>
>> root
>>  |-- _corrupt_record: string (nullable = true)
>>
>> Both are legit JSON structures… Do you think that #2 is a bug?
>>
>> jg
>>
>>
>>
>>
>>
>>
>
>
> --
> Luciano Resende
> http://twitter.com/lresende1975
> http://lresende.blogspot.com/
>
>
>


Re: pyspark: sqlContext.read.text() does not work with a list of paths

2016-10-06 Thread Hyukjin Kwon
It seems obviously a bug. It was introduced from my PR,
https://github.com/apache/spark/commit/d37c7f7f042f7943b5b684e53cf4284c601fb347

+1 for creating a JIRA and PR. If you have any problem with this, I would
like to do this quickly.


On 5 Oct 2016 9:12 p.m., "Laurent Legrand"  wrote:

> Hello,
>
> When I try to load multiple text files with the sqlContext, I get the
> following error:
>
> spark-2.0.0-bin-hadoop2.4/python/lib/pyspark.zip/pyspark/
> sql/readwriter.py",
> line 282, in text
> UnboundLocalError: local variable 'path' referenced before assignment
>
> According to the code
> (https://github.com/apache/spark/blob/master/python/pyspark/
> sql/readwriter.py#L291),
> the variable 'path' is not set if the argument is not a string.
>
> Could you confirm it is a bug?
>
> Regards,
> Laurent
>
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/pyspark-sqlContext-read-text-does-not-
> work-with-a-list-of-paths-tp27838.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: spark sql on json

2016-09-29 Thread Hyukjin Kwon
https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/sql/JavaSparkSQLExample.java#L104-L181

2016-09-29 18:58 GMT+09:00 Hitesh Goyal :

> Hi team,
>
>
>
> I have a json document. I want to put spark SQL to it.
>
> Can you please send me an example app built in JAVA so that I would be
> able to put spark sql queries on my data.
>
>
>
> Regards,
>
> *Hitesh Goyal*
>
> Simpli5d Technologies
>
> Cont No.: 9996588220
>
>
>


Re: Issue with rogue data in csv file used in Spark application

2016-09-27 Thread Hyukjin Kwon
Hi Mich,

I guess you could use nullValue option by setting it to null.

If you are reading them into strings at the first please, then, you would
meet https://github.com/apache/spark/pull/14118 first which is resolved
from 2.0.1

Unfortunately, this bug also exists in external csv library for strings if
I recall correctly.

However, it'd be fine if you set the schema explicitly when you load as
this bug does not exists for floats at least.

I hope this is helpful.

Thanks!

On 28 Sep 2016 7:06 a.m., "Mich Talebzadeh" 
wrote:

> Thanks guys
>
> Actually these are the 7 rogue rows. The column 0 is the Volume column
> which means there was no trades on those days
>
>
> *cat stock.csv|grep ",0"*SAP SE,SAP, 23-Dec-11,-,-,-,40.56,0
> SAP SE,SAP, 21-Apr-11,-,-,-,45.85,0
> SAP SE,SAP, 30-Dec-10,-,-,-,38.10,0
> SAP SE,SAP, 23-Dec-10,-,-,-,38.36,0
> SAP SE,SAP, 30-Apr-08,-,-,-,32.39,0
> SAP SE,SAP, 29-Apr-08,-,-,-,33.05,0
> SAP SE,SAP, 28-Apr-08,-,-,-,32.60,0
>
> So one way would be to exclude the rows that there was no volume of trade
> that day when cleaning up the csv file
>
> *cat stock.csv|grep -v **",0"*
>
> and that works. Bearing in mind that putting 0s in place of "-" will skew
> the price plot.
>
> BTW I am using Spark csv as well
>
> val df1 = spark.read.option("header", true).csv(location)
>
> This is the class and the mapping
>
>
> case class columns(Stock: String, Ticker: String, TradeDate: String, Open:
> Float, High: Float, Low: Float, Close: Float, Volume: Integer)
> val df2 = df1.map(p => columns(p(0).toString, p(1).toString,
> p(2).toString, p(3).toString.toFloat, p(4).toString.toFloat,
> p(5).toString.toFloat, p(6).toString.toFloat, p(7).toString.toInt))
>
>
> In here I have
>
> p(3).toString.toFloat
>
> How can one check for rogue data in p(3)?
>
>
> Thanks
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 27 September 2016 at 21:49, Mich Talebzadeh 
> wrote:
>
>>
>> I have historical prices for various stocks.
>>
>> Each csv file has 10 years trade one row per each day.
>>
>> These are the columns defined in the class
>>
>> case class columns(Stock: String, Ticker: String, TradeDate: String,
>> Open: Float, High: Float, Low: Float, Close: Float, Volume: Integer)
>>
>> The issue is with Open, High, Low, Close columns that all are defined as
>> Float.
>>
>> Most rows are OK like below but the red one with "-" defined as Float
>> causes issues
>>
>>   Date Open High  Low   Close Volume
>> 27-Sep-16 80.91 80.93 79.87 80.85 1873158
>> 23-Dec-11   - --40.56 0
>>
>> Because the prices are defined as Float, these rows cause the application
>> to crash
>> scala> val rs = df2.filter(changeToDate("TradeDate") >=
>> monthsago).select((changeToDate("TradeDate").as("TradeDate")
>> ),(('Close+'Open)/2).as("AverageDailyPrice"), 'Low.as("Day's Low"),
>> 'High.as("Day's High")).orderBy("TradeDate").collect
>> 16/09/27 21:48:53 ERROR Executor: Exception in task 0.0 in stage 61.0
>> (TID 260)
>> java.lang.NumberFormatException: For input string: "-"
>>
>>
>> One way is to define the prices as Strings but that is not
>> meaningful. Alternatively do the clean up before putting csv in HDFS but
>> that becomes tedious and error prone.
>>
>> Any ideas will be appreciated.
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>
>


Re: NumberFormatException: For input string: "0.00000"

2016-09-19 Thread Hyukjin Kwon
It seems not an issue in Spark. Does "CSVParser" works fine without Spark
with the data?

BTW, it seems there is something wrong with your email address. I am
sending this again.

On 20 Sep 2016 8:32 a.m., "Hyukjin Kwon" <gurwls...@gmail.com> wrote:

> It seems not an issue in Spark. Does "CSVParser" works fine without Spark
> with the data?
>
> On 20 Sep 2016 2:15 a.m., "Mohamed ismail" <mismai...@yahoo.com.invalid>
> wrote:
>
>> Hi all
>>
>> I am trying to read:
>>
>> sc.textFile(DataFile).mapPartitions(lines => {
>> val parser = new CSVParser(",")
>> lines.map(line=>parseLineToTuple(line, parser))
>> })
>> Data looks like:
>> android phone,0,0,0,,0,0,0,0,0,0,0,5,0,0,0,5,0,0.0,0.0,0.000
>> 00,0.0,0.0,0,0,0,0,0,0,0,0.0,0,0,0
>> ios phone,0,-1,0,,0,0,0,0,0,0,1,0,0,0,0,1,0,0.0,0.0,0.00
>> 000,0.0,0.0,0,0,0,0,0,0,0,0.0,0,0,0
>>
>> org.apache.spark.SparkException: Job aborted due to stage failure: Task
>> 1 in stage 23055.0 failed 4 times, most recent failure: Lost task 1.3 in
>> stage 23055.0 (TID 191607, ):
>> java.lang.NumberFormatException: For input string: "0.0"
>>
>> Has anyone faced such issues. Is there a solution?
>>
>> Thanks,
>> Mohamed
>>
>>


Re: NumberFormatException: For input string: "0.00000"

2016-09-19 Thread Hyukjin Kwon
It seems not an issue in Spark. Does "CSVParser" works fine without Spark
with the data?

On 20 Sep 2016 2:15 a.m., "Mohamed ismail" 
wrote:

> Hi all
>
> I am trying to read:
>
> sc.textFile(DataFile).mapPartitions(lines => {
> val parser = new CSVParser(",")
> lines.map(line=>parseLineToTuple(line, parser))
> })
> Data looks like:
> android phone,0,0,0,,0,0,0,0,0,0,0,5,0,0,0,5,0,0.0,0.0,0.
> 0,0.0,0.0,0,0,0,0,0,0,0,0.0,0,0,0
> ios phone,0,-1,0,,0,0,0,0,0,0,1,0,0,0,0,1,0,0.0,0.0,0.
> 0,0.0,0.0,0,0,0,0,0,0,0,0.0,0,0,0
>
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 1
> in stage 23055.0 failed 4 times, most recent failure: Lost task 1.3 in
> stage 23055.0 (TID 191607, ):
> java.lang.NumberFormatException: For input string: "0.0"
>
> Has anyone faced such issues. Is there a solution?
>
> Thanks,
> Mohamed
>
>


How many are there PySpark Windows users?

2016-09-18 Thread Hyukjin Kwon
Hi all,

We are currently testing SparkR on Windows[1] and it seems several problems
are being identified time to time. Although It seems it is not easy to
automate Spark's tests in Scala on Windows because I think we should
introduce a proper change detection to run only related tests rather than
running whole tests and etc., I think it is feasible to automate PySpark
tests on Windows because it does not take so long time.

I believe there are pretty many data analysts using SparkR on Windows but I
am not sure how many guys are using PySpark on Windows.

Currently, there are some issues with SparkR testing automation (e.g.
triggering some builds/tests on PRs not related with SparkR) so we might
have to try this latter but I just want to know and make sure before my
future PR and taking a look about this futher and then I would like to add
this thread as a reference.

I would appreciate if any of you shares your opinion/experience or some
issues on Windows in PySpark.

Thank you very much.

[1] https://github.com/apache/spark/pull/14859


Re: take() works on RDD but .write.json() does not work in 2.0.0

2016-09-17 Thread Hyukjin Kwon
Hi Kevin,

I have few questions on this.

Does that only not work with write.json() ? I just wonder if write.text,
csv or another API does not work as well and it is a JSON specific issue.

Also, does that work with small data? I want to make sure if this happen
only on large data.

Thanks!



2016-09-18 6:42 GMT+09:00 Kevin Burton :

> I'm seeing some weird behavior and wanted some feedback.
>
> I have a fairly large, multi-hour job that operates over about 5TB of data.
>
> It builds it out into a ranked category index of about 25000 categories
> sorted by rank, descending.
>
> I want to write this to a file but it's not actually writing any data.
>
> if I run myrdd.take(100) ... that works fine and prints data to a file.
>
> If I run
>
> myrdd.write.json(), it takes the same amount of time, and then writes a
> local file with a SUCCESS file but no actual partition data in the file.
> There's only one small file with SUCCESS.
>
> Any advice on how to debug this?
>
> --
>
> We’re hiring if you know of any awesome Java Devops or Linux Operations
> Engineers!
>
> Founder/CEO Spinn3r.com
> Location: *San Francisco, CA*
> blog: http://burtonator.wordpress.com
> … or check out my Google+ profile
> 
>
>


Re: Spark CSV skip lines

2016-09-10 Thread Hyukjin Kwon
As you are reading each record as each file via wholeTextFiles and
falttening them to records, I think you can just drop the few lines as you
want.

Can you just drop or skip few lines from reader.readAll().map(...)?

Also, are you sure this is an issue in Spark or external CSV library issue?

Do you mind if I ask the stack trace if you think so?

On 11 Sep 2016 1:50 a.m., "Selvam Raman" <sel...@gmail.com> wrote:

> Hi,
>
> I saw this two option already anyway thanks for the idea.
>
> i am using wholetext file to read my data(cause there are  \n middle of
> it) and using opencsv to parse the data. In my data first two lines are
> just some report. how can i eliminate.
>
> *How to eliminate first two lines after reading from wholetextfiles.*
>
> val test = wholeTextFiles.flatMap{ case (_, txt) =>
>  | val reader = new CSVReader(new StringReader(txt));
>  | reader.readAll().map(data => Row(data(3),data(4),data(7),
> data(9),data(14)))}
>
> The above code throws arrayoutofbounce exception for empty line and report
> line.
>
>
> On Sat, Sep 10, 2016 at 3:02 PM, Hyukjin Kwon <gurwls...@gmail.com> wrote:
>
>> Hi Selvam,
>>
>> If your report is commented with any character (e.g. #), you can skip
>> these lines via comment option [1].
>>
>> If you are using Spark 1.x, then you might be able to do this by manually
>> skipping from the RDD and then making this to DataFrame as below:
>>
>> I haven’t tested this but I think this should work.
>>
>> val rdd = sparkContext.textFile("...")
>> val filteredRdd = rdd.mapPartitionsWithIndex { (idx, iter) =>
>>   if (idx == 0) {
>> iter.drop(10)
>>   } else {
>> iter
>>   }
>> }
>> val df = new CsvParser().csvRdd(sqlContext, filteredRdd)
>>
>> If you are using Spark 2.0, then it seems there is no way to manually
>> modifying the source data because loading existing RDD or DataSet[String]
>> to DataFrame is not yet supported.
>>
>> There is an issue open[2]. I hope this is helpful.
>>
>> Thanks.
>>
>> [1] https://github.com/apache/spark/blob/27209252f09ff73c58e
>> 60c6df8aaba73b308088c/sql/core/src/main/scala/org/
>> apache/spark/sql/DataFrameReader.scala#L369
>> [2] https://issues.apache.org/jira/browse/SPARK-15463
>>
>>
>> ​
>>
>>
>> On 10 Sep 2016 6:14 p.m., "Selvam Raman" <sel...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I am using spark csv to read csv file. The issue is my files first n
>>> lines contains some report and followed by actual data (header and rest of
>>> the data).
>>>
>>> So how can i skip first n lines in spark csv. I dont have any specific
>>> comment character in the first byte.
>>>
>>> Please give me some idea.
>>>
>>> --
>>> Selvam Raman
>>> "லஞ்சம் தவிர்த்து நெஞ்சம் நிமிர்த்து"
>>>
>>
>
>
> --
> Selvam Raman
> "லஞ்சம் தவிர்த்து நெஞ்சம் நிமிர்த்து"
>


Re: Reading a TSV file

2016-09-10 Thread Hyukjin Kwon
Yeap. also, sep is preferred and has a higher precedence than delimiter.
​

2016-09-11 0:44 GMT+09:00 Jacek Laskowski :

> Hi Muhammad,
>
> sep or delimiter should both work fine.
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
>
> On Sat, Sep 10, 2016 at 10:42 AM, Muhammad Asif Abbasi
>  wrote:
> > Thanks for responding. I believe i had already given scala example as a
> part
> > of my code in the second email.
> >
> > Just looked at the DataFrameReader code, and it appears the following
> would
> > work in Java.
> >
> > Dataset pricePaidDS = spark.read().option("sep","\t"
> ).csv(fileName);
> >
> > Thanks for your help.
> >
> > Cheers,
> >
> >
> >
> > On Sat, Sep 10, 2016 at 2:49 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com>
> > wrote:
> >>
> >> Read header false not true
> >>
> >>  val df2 = spark.read.option("header",
> >> false).option("delimiter","\t").csv("hdfs://rhes564:9000/
> tmp/nw_10124772.tsv")
> >>
> >>
> >>
> >> Dr Mich Talebzadeh
> >>
> >>
> >>
> >> LinkedIn
> >> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCd
> OABUrV8Pw
> >>
> >>
> >>
> >> http://talebzadehmich.wordpress.com
> >>
> >>
> >> Disclaimer: Use it at your own risk. Any and all responsibility for any
> >> loss, damage or destruction of data or any other property which may
> arise
> >> from relying on this email's technical content is explicitly
> disclaimed. The
> >> author will in no case be liable for any monetary damages arising from
> such
> >> loss, damage or destruction.
> >>
> >>
> >>
> >>
> >> On 10 September 2016 at 14:46, Mich Talebzadeh <
> mich.talebza...@gmail.com>
> >> wrote:
> >>>
> >>> This should be pretty straight forward?
> >>>
> >>> You can create a tab separated file from any database table and buck
> copy
> >>> out, MSSQL, Sybase etc
> >>>
> >>>  bcp scratchpad..nw_10124772 out nw_10124772.tsv -c -t '\t' -Usa
> -A16384
> >>> Password:
> >>> Starting copy...
> >>> 441 rows copied.
> >>>
> >>> more nw_10124772.tsv
> >>> Mar 22 2011 12:00:00:000AM  SBT 602424  10124772FUNDS
> >>> TRANSFER , FROM A/C 17904064  200.00  200.00
> >>> Mar 22 2011 12:00:00:000AM  SBT 602424  10124772FUNDS
> >>> TRANSFER , FROM A/C 36226823  454.74  654.74
> >>>
> >>> Put that file into hdfs. Note that it has no headers
> >>>
> >>> Read in as a tsv file
> >>>
> >>> scala> val df2 = spark.read.option("header",
> >>> true).option("delimiter","\t").csv("hdfs://rhes564:9000/tmp/
> nw_10124772.tsv")
> >>> df2: org.apache.spark.sql.DataFrame = [Mar 22 2011 12:00:00:000AM:
> >>> string, SBT: string ... 6 more fields]
> >>>
> >>> scala> df2.first
> >>> res7: org.apache.spark.sql.Row = [Mar 22 2011
> >>> 12:00:00:000AM,SBT,602424,10124772,FUNDS TRANSFER , FROM A/C
> >>> 17904064,200.00,,200.00]
> >>>
> >>> HTH
> >>>
> >>>
> >>> Dr Mich Talebzadeh
> >>>
> >>>
> >>>
> >>> LinkedIn
> >>> https://www.linkedin.com/profile/view?id=
> AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> >>>
> >>>
> >>>
> >>> http://talebzadehmich.wordpress.com
> >>>
> >>>
> >>> Disclaimer: Use it at your own risk. Any and all responsibility for any
> >>> loss, damage or destruction of data or any other property which may
> arise
> >>> from relying on this email's technical content is explicitly
> disclaimed. The
> >>> author will in no case be liable for any monetary damages arising from
> such
> >>> loss, damage or destruction.
> >>>
> >>>
> >>>
> >>>
> >>> On 10 September 2016 at 13:57, Mich Talebzadeh
> >>>  wrote:
> 
>  Thanks Jacek.
> 
>  The old stuff with databricks
> 
>  scala> val df =
>  spark.read.format("com.databricks.spark.csv").option("inferSchema",
>  "true").option("header",
>  "true").load("hdfs://rhes564:9000/data/stg/accounts/ll/18740868")
>  df: org.apache.spark.sql.DataFrame = [Transaction Date: string,
>  Transaction Type: string ... 7 more fields]
> 
>  Now I can do
> 
>  scala> val df2 = spark.read.option("header",
>  true).csv("hdfs://rhes564:9000/data/stg/accounts/ll/18740868")
>  df2: org.apache.spark.sql.DataFrame = [Transaction Date: string,
>  Transaction Type: string ... 7 more fields]
> 
>  About Schema stuff that apparently Spark works out itself
> 
>  scala> df.printSchema
>  root
>   |-- Transaction Date: string (nullable = true)
>   |-- Transaction Type: string (nullable = true)
>   |-- Sort Code: string (nullable = true)
>   |-- Account Number: integer (nullable = true)
>   |-- Transaction Description: string (nullable = true)
>   |-- Debit Amount: double (nullable = true)
>   |-- Credit Amount: double (nullable = true)
>   |-- Balance: double (nullable = true)
>   |-- _c8: string (nullable = true)
> 
>  

  1   2   >