Re: Why spark-submit works with package not with jar
Are you sure com.google.api.client.http.HttpRequestInitialize is in the spark-bigquery-latest.jar or it may be in the transitive dependency of spark-bigquery_2.11? On Sat, May 4, 2024 at 7:43 PM Mich Talebzadeh wrote: > > Mich Talebzadeh, > Technologist | Architect | Data Engineer | Generative AI | FinCrime > London > United Kingdom > > >view my Linkedin profile > <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> > > > https://en.everybodywiki.com/Mich_Talebzadeh > > > > *Disclaimer:* The information provided is correct to the best of my > knowledge but of course cannot be guaranteed . It is essential to note > that, as with any advice, quote "one test result is worth one-thousand > expert opinions (Werner <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von > Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)". > > > -- Forwarded message - > From: Mich Talebzadeh > Date: Tue, 20 Oct 2020 at 16:50 > Subject: Why spark-submit works with package not with jar > To: user @spark > > > Hi, > > I have a scenario that I use in Spark submit as follows: > > spark-submit --driver-class-path /home/hduser/jars/ddhybrid.jar --jars > /home/hduser/jars/spark-bigquery-latest.jar,/home/hduser/jars/ddhybrid.jar, > */home/hduser/jars/spark-bigquery_2.11-0.2.6.jar* > > As you can see the jar files needed are added. > > > This comes back with error message as below > > > Creating model test.weights_MODEL > > java.lang.NoClassDefFoundError: > com/google/api/client/http/HttpRequestInitializer > > at > com.samelamin.spark.bigquery.BigQuerySQLContext.bq$lzycompute(BigQuerySQLContext.scala:19) > > at > com.samelamin.spark.bigquery.BigQuerySQLContext.bq(BigQuerySQLContext.scala:19) > > at > com.samelamin.spark.bigquery.BigQuerySQLContext.runDMLQuery(BigQuerySQLContext.scala:105) > > ... 76 elided > > Caused by: java.lang.ClassNotFoundException: > com.google.api.client.http.HttpRequestInitializer > > at java.net.URLClassLoader.findClass(URLClassLoader.java:382) > > at java.lang.ClassLoader.loadClass(ClassLoader.java:424) > > at java.lang.ClassLoader.loadClass(ClassLoader.java:357) > > > > So there is an issue with finding the class, although the jar file used > > > /home/hduser/jars/spark-bigquery_2.11-0.2.6.jar > > has it. > > > Now if *I remove the above jar file and replace it with the same version > but package* it works! > > > spark-submit --driver-class-path /home/hduser/jars/ddhybrid.jar --jars > /home/hduser/jars/spark-bigquery-latest.jar,/home/hduser/jars/ddhybrid.jar > *-**-packages com.github.samelamin:spark-bigquery_2.11:0.2.6* > > > I have read the write-ups about packages searching the maven > libraries etc. Not convinced why using the package should make so much > difference between a failure and success. In other words, when to use a > package rather than a jar. > > > Any ideas will be appreciated. > > > Thanks > > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > -- Best Regards Jeff Zhang
Re: Welcoming some new committers and PMC members
Congratulations! Saisai Shao 于2019年9月10日周二 上午9:16写道: > Congratulations! > > Jungtaek Lim 于2019年9月9日周一 下午6:11写道: > >> Congratulations! Well deserved! >> >> On Tue, Sep 10, 2019 at 9:51 AM John Zhuge wrote: >> >>> Congratulations! >>> >>> On Mon, Sep 9, 2019 at 5:45 PM Shane Knapp wrote: >>> >>>> congrats everyone! :) >>>> >>>> On Mon, Sep 9, 2019 at 5:32 PM Matei Zaharia >>>> wrote: >>>> > >>>> > Hi all, >>>> > >>>> > The Spark PMC recently voted to add several new committers and one >>>> PMC member. Join me in welcoming them to their new roles! >>>> > >>>> > New PMC member: Dongjoon Hyun >>>> > >>>> > New committers: Ryan Blue, Liang-Chi Hsieh, Gengliang Wang, Yuming >>>> Wang, Weichen Xu, Ruifeng Zheng >>>> > >>>> > The new committers cover lots of important areas including ML, SQL, >>>> and data sources, so it’s great to have them here. All the best, >>>> > >>>> > Matei and the Spark PMC >>>> > >>>> > >>>> > - >>>> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>>> > >>>> >>>> >>>> -- >>>> Shane Knapp >>>> UC Berkeley EECS Research / RISELab Staff Technical Lead >>>> https://rise.cs.berkeley.edu >>>> >>>> - >>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>>> >>>> >>> >>> -- >>> John Zhuge >>> >> >> >> -- >> Name : Jungtaek Lim >> Blog : http://medium.com/@heartsavior >> Twitter : http://twitter.com/heartsavior >> LinkedIn : http://www.linkedin.com/in/heartsavior >> > -- Best Regards Jeff Zhang
Re: [ANNOUNCE] Announcing Apache Spark 2.2.3
Congrats, Great work Dongjoon. Dongjoon Hyun 于2019年1月15日周二 下午3:47写道: > We are happy to announce the availability of Spark 2.2.3! > > Apache Spark 2.2.3 is a maintenance release, based on the branch-2.2 > maintenance branch of Spark. We strongly recommend all 2.2.x users to > upgrade to this stable release. > > To download Spark 2.2.3, head over to the download page: > http://spark.apache.org/downloads.html > > To view the release notes: > https://spark.apache.org/releases/spark-release-2-2-3.html > > We would like to acknowledge all community members for contributing to > this release. This release would not have been possible without you. > > Bests, > Dongjoon. > -- Best Regards Jeff Zhang
Re: Kubernetes backend and docker images
Awesome, less is better Mridul Muralidharan 于2018年1月6日周六 上午11:54写道: > > We should definitely clean this up and make it the default, nicely done > Marcelo ! > > Thanks, > Mridul > > On Fri, Jan 5, 2018 at 5:06 PM Marcelo Vanzin wrote: > >> Hey all, especially those working on the k8s stuff. >> >> Currently we have 3 docker images that need to be built and provided >> by the user when starting a Spark app: driver, executor, and init >> container. >> >> When the initial review went by, I asked why do we need 3, and I was >> told that's because they have different entry points. That never >> really convinced me, but well, everybody wanted to get things in to >> get the ball rolling. >> >> But I still think that's not the best way to go. I did some pretty >> simple hacking and got things to work with a single image: >> >> https://github.com/vanzin/spark/commit/k8s-img >> >> Is there a reason why that approach would not work? You could still >> create separate images for driver and executor if wanted, but there's >> no reason I can see why we should need 3 images for the simple case. >> >> Note that the code there can be cleaned up still, and I don't love the >> idea of using env variables to propagate arguments to the container, >> but that works for now. >> >> -- >> Marcelo >> >> - >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >> >>
Re: Faster Spark on ORC with Apache ORC
Awesome, Dong Joon, It's a great improvement. Looking forward its merge. Dong Joon Hyun 于2017年7月12日周三 上午6:53写道: > Hi, All. > > > > Since Apache Spark 2.2 vote passed successfully last week, > > I think it’s a good time for me to ask your opinions again about the > following PR. > > > > https://github.com/apache/spark/pull/17980 (+3,887, −86) > > > > It’s for the following issues. > > > >- SPARK-20728: Make ORCFileFormat configurable between sql/hive and >sql/core >- SPARK-20682: Support a new faster ORC data source based on Apache ORC > > > > Basically, the approach is trying to use the latest Apache ORC 1.4.0 > officially. > > You can switch between the legacy ORC data source and new ORC datasource. > > > > Could you help me to progress this in order to improve Apache Spark 2.3? > > > > Bests, > > Dongjoon. > > > > *From: *Dong Joon Hyun > > > *Date: *Tuesday, May 9, 2017 at 6:15 PM > *To: *"dev@spark.apache.org" > *Subject: *Faster Spark on ORC with Apache ORC > > > > Hi, All. > > > > Apache Spark always has been a fast and general engine, and > > since SPARK-2883, Spark supports Apache ORC inside `sql/hive` module with > Hive dependency. > > > > With Apache ORC 1.4.0 (released yesterday), we can make Spark on ORC > faster and get some benefits. > > > > - Speed: Use both Spark `ColumnarBatch` and ORC `RowBatch` together > which means full vectorization support. > > > > - Stability: Apache ORC 1.4.0 already has many fixes and we can depend > on ORC community effort in the future. > > > > - Usability: Users can use `ORC` data sources without hive module > (-Phive) > > > > - Maintainability: Reduce the Hive dependency and eventually remove > some old legacy code from `sql/hive` module. > > > > As a first step, I made a PR adding a new ORC data source into `sql/core` > module. > > > > https://github.com/apache/spark/pull/17924 (+ 3,691 lines, -0) > > > > Could you give some opinions on this approach? > > > > Bests, > > Dongjoon. >
Re: [Important for PySpark Devs]: Master now tests with Python 2.7 rather than 2.6 - please retest any Python PRs
Thanks, retrigger serveral pyspark PRs Hyukjin Kwon 于2017年3月30日周四 上午7:42写道: > Thank you for informing this. > > On 30 Mar 2017 3:52 a.m., "Holden Karau" wrote: > > Hi PySpark Developers, > > In https://issues.apache.org/jira/browse/SPARK-19955 / > https://github.com/apache/spark/pull/17355, as part of our continued > Python 2.6 deprecation https://issues.apache.org/jira/browse/SPARK-15902 > & eventual removal https://issues.apache.org/jira/browse/SPARK-12661 , > Jenkins master will now test with Python 2.7 rather than Python 2.6. If you > have a pending Python PR please re-run Jenkins tests prior to merge to > avoid issues. > > For your local testing *make sure you have a version of Python 2.7 > installed on your machine *otherwise it will default to using the python > executable and in the future you may run into compatibility issues. > > Note: this only impacts master and has not been merged to other branches, > so if you want to make fixes that are planned for back ported to 2.1, > please continue to use 2.6 compatible Python code (and note you can always > explicitly set a python version to be run with the --python-executables > flag when testing locally). > > Cheers, > > Holden :) > > P.S. > > If you run int any issues around this please feel free (as always) to > reach out and ping me. > > -- > Cell : 425-233-8271 <(425)%20233-8271> > Twitter: https://twitter.com/holdenkarau > >
Re: welcoming Burak and Holden as committers
Congratulations Burak and Holden! Yanbo Liang 于2017年1月25日周三 上午11:54写道: > Congratulations, Burak and Holden. > > On Tue, Jan 24, 2017 at 7:32 PM, Chester Chen > wrote: > > Congratulation to both. > > > > Holden, we need catch up. > > > > > > *Chester Chen * > > ■ Senior Manager – Data Science & Engineering > > 3000 Clearview Way > > San Mateo, CA 94402 > > > > > > *From: *Felix Cheung > *Date: *Tuesday, January 24, 2017 at 1:20 PM > *To: *Reynold Xin , "dev@spark.apache.org" < > dev@spark.apache.org> > *Cc: *Holden Karau , Burak Yavuz < > bu...@databricks.com> > *Subject: *Re: welcoming Burak and Holden as committers > > > > Congrats and welcome!! > > > -- > > *From:* Reynold Xin > *Sent:* Tuesday, January 24, 2017 10:13:16 AM > *To:* dev@spark.apache.org > *Cc:* Burak Yavuz; Holden Karau > *Subject:* welcoming Burak and Holden as committers > > > > Hi all, > > > > Burak and Holden have recently been elected as Apache Spark committers. > > > > Burak has been very active in a large number of areas in Spark, including > linear algebra, stats/maths functions in DataFrames, Python/R APIs for > DataFrames, dstream, and most recently Structured Streaming. > > > > Holden has been a long time Spark contributor and evangelist. She has > written a few books on Spark, as well as frequent contributions to the > Python API to improve its usability and performance. > > > > Please join me in welcoming the two! > > > > > > >
Re: [VOTE] Release Apache Spark 1.6.3 (RC2)
+1 Dongjoon Hyun 于2016年11月4日周五 上午9:44写道: > +1 (non-binding) > > It's built and tested on CentOS 6.8 / OpenJDK 1.8.0_111, too. > > Cheers, > Dongjoon. > > On 2016-11-03 14:30 (-0700), Davies Liu wrote: > > +1 > > > > On Wed, Nov 2, 2016 at 5:40 PM, Reynold Xin wrote: > > > Please vote on releasing the following candidate as Apache Spark > version > > > 1.6.3. The vote is open until Sat, Nov 5, 2016 at 18:00 PDT and passes > if a > > > majority of at least 3+1 PMC votes are cast. > > > > > > [ ] +1 Release this package as Apache Spark 1.6.3 > > > [ ] -1 Do not release this package because ... > > > > > > > > > The tag to be voted on is v1.6.3-rc2 > > > (1e860747458d74a4ccbd081103a0542a2367b14b) > > > > > > This release candidate addresses 52 JIRA tickets: > > > https://s.apache.org/spark-1.6.3-jira > > > > > > The release files, including signatures, digests, etc. can be found at: > > > http://people.apache.org/~pwendell/spark-releases/spark-1.6.3-rc2-bin/ > > > > > > Release artifacts are signed with the following key: > > > https://people.apache.org/keys/committer/pwendell.asc > > > > > > The staging repository for this release can be found at: > > > > https://repository.apache.org/content/repositories/orgapachespark-1212/ > > > > > > The documentation corresponding to this release can be found at: > > > > http://people.apache.org/~pwendell/spark-releases/spark-1.6.3-rc2-docs/ > > > > > > > > > === > > > == How can I help test this release? > > > === > > > If you are a Spark user, you can help us test this release by taking an > > > existing Spark workload and running on this release candidate, then > > > reporting any regressions from 1.6.2. > > > > > > > > > == What justifies a -1 vote for this release? > > > > > > This is a maintenance release in the 1.6.x series. Bugs already > present in > > > 1.6.2, missing features, or bugs related to new features will not > > > necessarily block this release. > > > > > > > - > > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > > > > > > - > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > >
Re: [VOTE] Release Apache Spark 2.0.1 (RC4)
+1 On Fri, Sep 30, 2016 at 9:27 AM, Burak Yavuz wrote: > +1 > > On Sep 29, 2016 4:33 PM, "Kyle Kelley" wrote: > >> +1 >> >> On Thu, Sep 29, 2016 at 4:27 PM, Yin Huai wrote: >> >>> +1 >>> >>> On Thu, Sep 29, 2016 at 4:07 PM, Luciano Resende >>> wrote: >>> >>>> +1 (non-binding) >>>> >>>> On Wed, Sep 28, 2016 at 7:14 PM, Reynold Xin >>>> wrote: >>>> >>>>> Please vote on releasing the following candidate as Apache Spark >>>>> version 2.0.1. The vote is open until Sat, Oct 1, 2016 at 20:00 PDT and >>>>> passes if a majority of at least 3+1 PMC votes are cast. >>>>> >>>>> [ ] +1 Release this package as Apache Spark 2.0.1 >>>>> [ ] -1 Do not release this package because ... >>>>> >>>>> >>>>> The tag to be voted on is v2.0.1-rc4 (933d2c1ea4e5f5c4ec8d375b5ccaa >>>>> 4577ba4be38) >>>>> >>>>> This release candidate resolves 301 issues: >>>>> https://s.apache.org/spark-2.0.1-jira >>>>> >>>>> The release files, including signatures, digests, etc. can be found at: >>>>> http://people.apache.org/~pwendell/spark-releases/spark-2.0.1-rc4-bin/ >>>>> >>>>> Release artifacts are signed with the following key: >>>>> https://people.apache.org/keys/committer/pwendell.asc >>>>> >>>>> The staging repository for this release can be found at: >>>>> https://repository.apache.org/content/repositories/orgapache >>>>> spark-1203/ >>>>> >>>>> The documentation corresponding to this release can be found at: >>>>> http://people.apache.org/~pwendell/spark-releases/spark-2.0. >>>>> 1-rc4-docs/ >>>>> >>>>> >>>>> Q: How can I help test this release? >>>>> A: If you are a Spark user, you can help us test this release by >>>>> taking an existing Spark workload and running on this release candidate, >>>>> then reporting any regressions from 2.0.0. >>>>> >>>>> Q: What justifies a -1 vote for this release? >>>>> A: This is a maintenance release in the 2.0.x series. Bugs already >>>>> present in 2.0.0, missing features, or bugs related to new features will >>>>> not necessarily block this release. >>>>> >>>>> Q: What fix version should I use for patches merging into branch-2.0 >>>>> from now on? >>>>> A: Please mark the fix version as 2.0.2, rather than 2.0.1. If a new >>>>> RC (i.e. RC5) is cut, I will change the fix version of those patches to >>>>> 2.0.1. >>>>> >>>>> >>>>> >>>> >>>> >>>> -- >>>> Luciano Resende >>>> http://twitter.com/lresende1975 >>>> http://lresende.blogspot.com/ >>>> >>> >>> >> >> >> -- >> Kyle Kelley (@rgbkrk <https://twitter.com/rgbkrk>; lambdaops.com) >> > -- Best Regards Jeff Zhang
Re: [VOTE] Release Apache Spark 2.0.1 (RC3)
+1 On Mon, Sep 26, 2016 at 2:03 PM, Shixiong(Ryan) Zhu wrote: > +1 > > On Sun, Sep 25, 2016 at 10:43 PM, Pete Lee wrote: > >> +1 >> >> >> On Sun, Sep 25, 2016 at 3:26 PM, Herman van Hövell tot Westerflier < >> hvanhov...@databricks.com> wrote: >> >>> +1 (non-binding) >>> >>> On Sun, Sep 25, 2016 at 2:05 PM, Ricardo Almeida < >>> ricardo.alme...@actnowib.com> wrote: >>> >>>> +1 (non-binding) >>>> >>>> Built and tested on >>>> - Ubuntu 16.04 / OpenJDK 1.8.0_91 >>>> - CentOS / Oracle Java 1.7.0_55 >>>> (-Phadoop-2.7 -Dhadoop.version=2.7.3 -Phive -Phive-thriftserver -Pyarn) >>>> >>>> >>>> On 25 September 2016 at 22:35, Matei Zaharia >>>> wrote: >>>> >>>>> +1 >>>>> >>>>> Matei >>>>> >>>>> On Sep 25, 2016, at 1:25 PM, Josh Rosen >>>>> wrote: >>>>> >>>>> +1 >>>>> >>>>> On Sun, Sep 25, 2016 at 1:16 PM Yin Huai wrote: >>>>> >>>>>> +1 >>>>>> >>>>>> On Sun, Sep 25, 2016 at 11:40 AM, Dongjoon Hyun >>>>>> wrote: >>>>>> >>>>>>> +1 (non binding) >>>>>>> >>>>>>> RC3 is compiled and tested on the following two systems, too. All >>>>>>> tests passed. >>>>>>> >>>>>>> * CentOS 7.2 / Oracle JDK 1.8.0_77 / R 3.3.1 >>>>>>>with -Pyarn -Phadoop-2.7 -Pkinesis-asl -Phive -Phive-thriftserver >>>>>>> -Dsparkr >>>>>>> * CentOS 7.2 / Open JDK 1.8.0_102 >>>>>>>with -Pyarn -Phadoop-2.7 -Pkinesis-asl -Phive -Phive-thriftserver >>>>>>> >>>>>>> Cheers, >>>>>>> Dongjoon >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Saturday, September 24, 2016, Reynold Xin >>>>>>> wrote: >>>>>>> >>>>>>>> Please vote on releasing the following candidate as Apache Spark >>>>>>>> version 2.0.1. The vote is open until Tue, Sep 27, 2016 at 15:30 PDT >>>>>>>> and >>>>>>>> passes if a majority of at least 3+1 PMC votes are cast. >>>>>>>> >>>>>>>> [ ] +1 Release this package as Apache Spark 2.0.1 >>>>>>>> [ ] -1 Do not release this package because ... >>>>>>>> >>>>>>>> >>>>>>>> The tag to be voted on is v2.0.1-rc3 (9d28cc10357a8afcfb2fa2e6eecb5 >>>>>>>> c2cc2730d17) >>>>>>>> >>>>>>>> This release candidate resolves 290 issues: >>>>>>>> https://s.apache.org/spark-2.0.1-jira >>>>>>>> >>>>>>>> The release files, including signatures, digests, etc. can be found >>>>>>>> at: >>>>>>>> http://people.apache.org/~pwendell/spark-releases/spark-2.0. >>>>>>>> 1-rc3-bin/ >>>>>>>> >>>>>>>> Release artifacts are signed with the following key: >>>>>>>> https://people.apache.org/keys/committer/pwendell.asc >>>>>>>> >>>>>>>> The staging repository for this release can be found at: >>>>>>>> https://repository.apache.org/content/repositories/orgapache >>>>>>>> spark-1201/ >>>>>>>> >>>>>>>> The documentation corresponding to this release can be found at: >>>>>>>> http://people.apache.org/~pwendell/spark-releases/spark-2.0. >>>>>>>> 1-rc3-docs/ >>>>>>>> >>>>>>>> >>>>>>>> Q: How can I help test this release? >>>>>>>> A: If you are a Spark user, you can help us test this release by >>>>>>>> taking an existing Spark workload and running on this release >>>>>>>> candidate, >>>>>>>> then reporting any regressions from 2.0.0. >>>>>>>> >>>>>>>> Q: What justifies a -1 vote for this release? >>>>>>>> A: This is a maintenance release in the 2.0.x series. Bugs already >>>>>>>> present in 2.0.0, missing features, or bugs related to new features >>>>>>>> will >>>>>>>> not necessarily block this release. >>>>>>>> >>>>>>>> Q: What fix version should I use for patches merging into >>>>>>>> branch-2.0 from now on? >>>>>>>> A: Please mark the fix version as 2.0.2, rather than 2.0.1. If a >>>>>>>> new RC (i.e. RC4) is cut, I will change the fix version of those >>>>>>>> patches to >>>>>>>> 2.0.1. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>> >>>>> >>>> >>> >> > -- Best Regards Jeff Zhang
Re: Welcoming Felix Cheung as a committer
Congrats Felix! On Tue, Aug 9, 2016 at 8:49 AM, Hyukjin Kwon wrote: > Congratulations! > > 2016-08-09 7:47 GMT+09:00 Xiao Li : > >> Congrats Felix! >> >> 2016-08-08 15:04 GMT-07:00 Herman van Hövell tot Westerflier >> : >> > Congrats Felix! >> > >> > On Mon, Aug 8, 2016 at 11:57 PM, dhruve ashar >> wrote: >> >> >> >> Congrats Felix! >> >> >> >> On Mon, Aug 8, 2016 at 2:28 PM, Tarun Kumar >> wrote: >> >>> >> >>> Congrats Felix! >> >>> >> >>> Tarun >> >>> >> >>> On Tue, Aug 9, 2016 at 12:57 AM, Timothy Chen >> wrote: >> >>>> >> >>>> Congrats Felix! >> >>>> >> >>>> Tim >> >>>> >> >>>> On Mon, Aug 8, 2016 at 11:15 AM, Matei Zaharia < >> matei.zaha...@gmail.com> >> >>>> wrote: >> >>>> > Hi all, >> >>>> > >> >>>> > The PMC recently voted to add Felix Cheung as a committer. Felix >> has >> >>>> > been a major contributor to SparkR and we're excited to have him >> join >> >>>> > officially. Congrats and welcome, Felix! >> >>>> > >> >>>> > Matei >> >>>> > ---- >> - >> >>>> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >> >>>> > >> >>>> >> >>>> >> - >> >>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >> >>>> >> >> >> >> >> >> >> >> -- >> >> -Dhruve Ashar >> >> >> > >> >> - >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >> >> > -- Best Regards Jeff Zhang
Re: Possible contribution to MLlib
I think it is valuable to make the distance function pluggable and also provide some builtin distance function. This might be also useful for other algorithms besides KMeans. On Tue, Jun 21, 2016 at 7:48 PM, Simon NANTY wrote: > Hi all, > > > > In my team, we are currently developing a fork of spark MLlib extending > K-means method such that it is possible to set its own distance function. > In this implementation, it could be possible to directly pass, in argument > of the K-means train function, a distance function whose signature is: > (VectorWithNorm, VectorWithNorm) => Double. > > > > We have found the Jira instance SPARK-11665 proposing to support new > distances in bisecting K-means. There has also been the Jira instance > SPARK-3219 proposing to add Bregman divergences as distance functions, but > it has not been added to MLlib. Therefore, we are wondering if such an > extension of MLlib K-means algorithm would be appreciated by the community > and would have chances to get included in future spark releases. > > > > Regards, > > > > Simon Nanty > > > -- Best Regards Jeff Zhang
Re: [vote] Apache Spark 2.0.0-preview release (rc1)
t;> [ 4.219 s] >>>>>>> [INFO] Spark Project External Flume ... SUCCESS >>>>>>> [ 6.987 s] >>>>>>> [INFO] Spark Project External Flume Assembly .. SUCCESS >>>>>>> [ 1.465 s] >>>>>>> [INFO] Spark Integration for Kafka 0.8 SUCCESS >>>>>>> [ 6.891 s] >>>>>>> [INFO] Spark Project Examples . SUCCESS >>>>>>> [ 13.465 s] >>>>>>> [INFO] Spark Project External Kafka Assembly .. SUCCESS >>>>>>> [ 2.815 s] >>>>>>> [INFO] >>>>>>> >>>>>>> [INFO] BUILD SUCCESS >>>>>>> [INFO] >>>>>>> >>>>>>> [INFO] Total time: 07:04 min >>>>>>> [INFO] Finished at: 2016-05-18T17:55:33+02:00 >>>>>>> [INFO] Final Memory: 90M/824M >>>>>>> [INFO] >>>>>>> >>>>>>> >>>>>>> On 18 May 2016, at 16:28, Sean Owen wrote: >>>>>>> >>>>>>> I think it's a good idea. Although releases have been preceded before >>>>>>> by release candidates for developers, it would be good to get a >>>>>>> formal >>>>>>> preview/beta release ratified for public consumption ahead of a new >>>>>>> major release. Better to have a little more testing in the wild to >>>>>>> identify problems before 2.0.0 is finalized. >>>>>>> >>>>>>> +1 to the release. License, sigs, etc check out. On Ubuntu 16 + Java >>>>>>> 8, compilation and tests succeed for "-Pyarn -Phive >>>>>>> -Phive-thriftserver -Phadoop-2.6". >>>>>>> >>>>>>> On Wed, May 18, 2016 at 6:40 AM, Reynold Xin >>>>>>> wrote: >>>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> In the past the Apache Spark community have created preview packages >>>>>>> (not >>>>>>> official releases) and used those as opportunities to ask community >>>>>>> members >>>>>>> to test the upcoming versions of Apache Spark. Several people in the >>>>>>> Apache >>>>>>> community have suggested we conduct votes for these preview packages >>>>>>> and >>>>>>> turn them into formal releases by the Apache foundation's standard. >>>>>>> Preview >>>>>>> releases are not meant to be functional, i.e. they can and highly >>>>>>> likely >>>>>>> will contain critical bugs or documentation errors, but we will be >>>>>>> able to >>>>>>> post them to the project's website to get wider feedback. They should >>>>>>> satisfy the legal requirements of Apache's release policy >>>>>>> (http://www.apache.org/dev/release.html) such as having proper >>>>>>> licenses. >>>>>>> >>>>>>> >>>>>>> Please vote on releasing the following candidate as Apache Spark >>>>>>> version >>>>>>> 2.0.0-preview. The vote is open until Friday, May 20, 2015 at 11:00 >>>>>>> PM PDT >>>>>>> and passes if a majority of at least 3 +1 PMC votes are cast. >>>>>>> >>>>>>> [ ] +1 Release this package as Apache Spark 2.0.0-preview >>>>>>> [ ] -1 Do not release this package because ... >>>>>>> >>>>>>> To learn more about Apache Spark, please see >>>>>>> http://spark.apache.org/ >>>>>>> >>>>>>> The tag to be voted on is 2.0.0-preview >>>>>>> (8f5a04b6299e3a47aca13cbb40e72344c0114860) >>>>>>> >>>>>>> The release files, including signatures, digests, etc. can be found >>>>>>> at: >>>>>>> >>>>>>> http://home.apache.org/~pwendell/spark-releases/spark-2.0.0-preview-bin/ >>>>>>> >>>>>>> Release artifacts are signed with the following key: >>>>>>> https://people.apache.org/keys/committer/pwendell.asc >>>>>>> >>>>>>> The documentation corresponding to this release can be found at: >>>>>>> >>>>>>> http://home.apache.org/~pwendell/spark-releases/spark-2.0.0-preview-docs/ >>>>>>> >>>>>>> The list of resolved issues are: >>>>>>> >>>>>>> https://issues.apache.org/jira/browse/SPARK-15351?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%202.0.0 >>>>>>> >>>>>>> >>>>>>> If you are a Spark user, you can help us test this release by taking >>>>>>> an >>>>>>> existing Apache Spark workload and running on this candidate, then >>>>>>> reporting >>>>>>> any regressions. >>>>>>> >>>>>>> >>>>>>> - >>>>>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org >>>>>>> For additional commands, e-mail: dev-h...@spark.apache.org >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>> >> > -- Best Regards Jeff Zhang
Re: executor delay in Spark
t; When submitting a job with spark-submit, I've observed delays (up to >>>>> 1--2 seconds) for the executors to respond to the driver in order to >>>>> receive tasks in the first stage. The delay does not persist once the >>>>> executors have been synchronized. >>>>> >>>>> When the tasks are very short, as may be your case (relatively small >>>>> data and a simple map task like you have described), the 8 tasks in >>>>> your stage may be allocated to only 1 executor in 2 waves of 4, since >>>>> the second executor won't have responded to the master before the >>>>> first 4 tasks on the first executor have completed. >>>>> >>>>> To see if this is the cause in your particular case, you could try the >>>>> following to confirm: >>>>> 1. Examine the starting times of the tasks alongside their >>>>> executor >>>>> 2. Make a "dummy" stage execute before your real stages to >>>>> synchronize the executors by creating and materializing any random RDD >>>>> 3. Make the tasks longer, i.e. with some silly computational >>>>> work. >>>>> >>>>> Mike >>>>> >>>>> >>>>> On 4/17/16, Raghava Mutharaju wrote: >>>>> > Yes its the same data. >>>>> > >>>>> > 1) The number of partitions are the same (8, which is an argument to >>>>> the >>>>> > HashPartitioner). In the first case, these partitions are spread >>>>> across >>>>> > both the worker nodes. In the second case, all the partitions are on >>>>> the >>>>> > same node. >>>>> > 2) What resources would be of interest here? Scala shell takes the >>>>> default >>>>> > parameters since we use "bin/spark-shell --master " to >>>>> run the >>>>> > scala-shell. For the scala program, we do set some configuration >>>>> options >>>>> > such as driver memory (12GB), parallelism is set to 8 and we use Kryo >>>>> > serializer. >>>>> > >>>>> > We are running this on Azure D3-v2 machines which have 4 cores and >>>>> 14GB >>>>> > RAM.1 executor runs on each worker node. Following configuration >>>>> options >>>>> > are set for the scala program -- perhaps we should move it to the >>>>> spark >>>>> > config file. >>>>> > >>>>> > Driver memory and executor memory are set to 12GB >>>>> > parallelism is set to 8 >>>>> > Kryo serializer is used >>>>> > Number of retainedJobs and retainedStages has been increased to >>>>> check them >>>>> > in the UI. >>>>> > >>>>> > What information regarding Spark Context would be of interest here? >>>>> > >>>>> > Regards, >>>>> > Raghava. >>>>> > >>>>> > On Sun, Apr 17, 2016 at 10:54 PM, Anuj Kumar >>>>> wrote: >>>>> > >>>>> >> If the data file is same then it should have similar distribution of >>>>> >> keys. >>>>> >> Few queries- >>>>> >> >>>>> >> 1. Did you compare the number of partitions in both the cases? >>>>> >> 2. Did you compare the resource allocation for Spark Shell vs Scala >>>>> >> Program being submitted? >>>>> >> >>>>> >> Also, can you please share the details of Spark Context, >>>>> Environment and >>>>> >> Executors when you run via Scala program? >>>>> >> >>>>> >> On Mon, Apr 18, 2016 at 4:41 AM, Raghava Mutharaju < >>>>> >> m.vijayaragh...@gmail.com> wrote: >>>>> >> >>>>> >>> Hello All, >>>>> >>> >>>>> >>> We are using HashPartitioner in the following way on a 3 node >>>>> cluster (1 >>>>> >>> master and 2 worker nodes). >>>>> >>> >>>>> >>> val u = >>>>> >>> sc.textFile("hdfs://x.x.x.x:8020/user/azureuser/s.txt").map[(Int, >>>>> >>> Int)](line => { line.split("\\|") match { case Array(x, y) => >>>>> (y.toInt, >>>>> >>> x.toInt) } }).partitionBy(new >>>>> HashPartitioner(8)).setName("u").persist() >>>>> >>> >>>>> >>> u.count() >>>>> >>> >>>>> >>> If we run this from the spark shell, the data (52 MB) is split >>>>> across >>>>> >>> the >>>>> >>> two worker nodes. But if we put this in a scala program and run >>>>> it, then >>>>> >>> all the data goes to only one node. We have run it multiple times, >>>>> but >>>>> >>> this >>>>> >>> behavior does not change. This seems strange. >>>>> >>> >>>>> >>> Is there some problem with the way we use HashPartitioner? >>>>> >>> >>>>> >>> Thanks in advance. >>>>> >>> >>>>> >>> Regards, >>>>> >>> Raghava. >>>>> >>> >>>>> >> >>>>> >> >>>>> > >>>>> > >>>>> > -- >>>>> > Regards, >>>>> > Raghava >>>>> > http://raghavam.github.io >>>>> > >>>>> >>>>> >>>>> -- >>>>> Thanks, >>>>> Mike >>>>> >>>> >>>> >>>> >>>> -- >>>> Regards, >>>> Raghava >>>> http://raghavam.github.io >>>> >>> >> >> >> -- >> Regards, >> Raghava >> http://raghavam.github.io >> > -- Best Regards Jeff Zhang
Spark build with scala-2.10 fails ?
Anyone can pass the spark build with scala-2.10 ? [info] Compiling 475 Scala sources and 78 Java sources to /Users/jzhang/github/spark/core/target/scala-2.10/classes... [error] /Users/jzhang/github/spark/core/src/main/scala/org/apache/spark/deploy/mesos/MesosExternalShuffleService.scala:30: object ShuffleServiceHeartbeat is not a member of package org.apache.spark.network.shuffle.protocol.mesos [error] import org.apache.spark.network.shuffle.protocol.mesos.{RegisterDriver, ShuffleServiceHeartbeat} [error]^ [error] /Users/jzhang/github/spark/core/src/main/scala/org/apache/spark/deploy/mesos/MesosExternalShuffleService.scala:87: not found: type ShuffleServiceHeartbeat [error] def unapply(h: ShuffleServiceHeartbeat): Option[String] = Some(h.getAppId) [error]^ [error] /Users/jzhang/github/spark/core/src/main/scala/org/apache/spark/deploy/mesos/MesosExternalShuffleService.scala:83: value getHeartbeatTimeoutMs is not a member of org.apache.spark.network.shuffle.protocol.mesos.RegisterDriver [error] Some((r.getAppId, new AppState(r.getHeartbeatTimeoutMs, System.nanoTime( [error]^ [error] /Users/jzhang/github/spark/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/CoarseMesosSchedulerBackend.scala:451: too many arguments for method registerDriverWithShuffleService: (x$1: String, x$2: Int)Unit [error] .registerDriverWithShuffleService( [error]^ [error] four errors found [error] Compile failed at Mar 17, 2016 2:45:22 PM [13.105s] -- Best Regards Jeff Zhang
Re: What should be spark.local.dir in spark on yarn?
You are using yarn-client mode, the driver is not yarn container, so it can not use yarn.nodemanager.local-dirs, only have to use spark.local.dir which /tmp by default. But usually driver won't cost too much disk, so it should be fine to use /tmp in driver side. On Tue, Mar 1, 2016 at 4:57 PM, Alexander Pivovarov wrote: > spark 1.6.0 uses /tmp in the following places > # spark.local.dir is not set > yarn.nodemanager.local-dirs=/data01/yarn/nm,/data02/yarn/nm > > 1. spark-shell on start > 16/03/01 08:33:48 INFO storage.DiskBlockManager: Created local directory > at /tmp/blockmgr-ffd3143d-b47f-4844-99fd-2d51c6a05d05 > > 2. spark-shell on start > 16/03/01 08:33:50 INFO yarn.Client: Uploading resource > file:/tmp/spark-456184c9-d59f-48f4-a9b0-560b7d310655/__spark_conf__6943938018805427428.zip > -> > hdfs://ip-10-101-124-30:8020/user/hadoop/.sparkStaging/application_1456776184284_0047/__spark_conf__6943938018805427428.zip > > 3. spark-shell spark-sql (Hive) on start > 16/03/01 08:34:06 INFO session.SessionState: Created local directory: > /tmp/01705299-a384-4e85-923b-e858017cf351_resources > 16/03/01 08:34:06 INFO session.SessionState: Created HDFS directory: > /tmp/hive/hadoop/01705299-a384-4e85-923b-e858017cf351 > 16/03/01 08:34:06 INFO session.SessionState: Created local directory: > /tmp/hadoop/01705299-a384-4e85-923b-e858017cf351 > 16/03/01 08:34:06 INFO session.SessionState: Created HDFS directory: > /tmp/hive/hadoop/01705299-a384-4e85-923b-e858017cf351/_tmp_space.db > > 4. Spark Executor container uses hadoop.tmp.dir /data01/tmp/hadoop-${ > user.name} for s3 output > > scala> sc.parallelize(1 to > 10).saveAsTextFile("s3n://my_bucket/test/p10_13"); > > 16/03/01 08:41:13 INFO s3native.NativeS3FileSystem: OutputStream for key > 'test/p10_13/part-0' writing to tempfile > '/data01/tmp/hadoop-hadoop/s3/output-7399167152756918334.tmp' > > > -- > > if I set spark.local.dir=/data01/tmp then #1 and #2 uses /data01/tmp instead > of /tmp > > -- > > > 1. 16/03/01 08:47:03 INFO storage.DiskBlockManager: Created local directory > at /data01/tmp/blockmgr-db88dbd2-0ef4-433a-95ea-b33392bbfb7f > > > 2. 16/03/01 08:47:05 INFO yarn.Client: Uploading resource > file:/data01/tmp/spark-aa3e619c-a368-4f95-bd41-8448a78ae456/__spark_conf__368426817234224667.zip > -> > hdfs://ip-10-101-124-30:8020/user/hadoop/.sparkStaging/application_1456776184284_0050/__spark_conf__368426817234224667.zip > > > 3. spark-sql (hive) still uses /tmp > > 16/03/01 08:47:20 INFO session.SessionState: Created local directory: > /tmp/d315926f-39d7-4dcb-b3fa-60e9976f7197_resources > 16/03/01 08:47:20 INFO session.SessionState: Created HDFS directory: > /tmp/hive/hadoop/d315926f-39d7-4dcb-b3fa-60e9976f7197 > 16/03/01 08:47:20 INFO session.SessionState: Created local directory: > /tmp/hadoop/d315926f-39d7-4dcb-b3fa-60e9976f7197 > 16/03/01 08:47:20 INFO session.SessionState: Created HDFS directory: > /tmp/hive/hadoop/d315926f-39d7-4dcb-b3fa-60e9976f7197/_tmp_space.db > > > 4. executor uses hadoop.tmp.dir for s3 output > > 16/03/01 08:50:01 INFO s3native.NativeS3FileSystem: OutputStream for key > 'test/p10_16/_SUCCESS' writing to tempfile > '/data01/tmp/hadoop-hadoop/s3/output-2541604454681305094.tmp' > > > 5. /data0X/yarn/nm used for usercache > > 16/03/01 08:41:12 INFO storage.DiskBlockManager: Created local directory at > /data01/yarn/nm/usercache/hadoop/appcache/application_1456776184284_0047/blockmgr-af5 > > > > On Mon, Feb 29, 2016 at 3:44 PM, Jeff Zhang wrote: > >> In yarn mode, spark.local.dir is yarn.nodemanager.local-dirs for shuffle >> data and block manager disk data. What do you mean "But output files to >> upload to s3 still created in /tmp on slaves" ? You should have control on >> where to store your output data if that means your job's output. >> >> On Tue, Mar 1, 2016 at 3:12 AM, Alexander Pivovarov > > wrote: >> >>> I have Spark on yarn >>> >>> I defined yarn.nodemanager.local-dirs to be >>> /data01/yarn/nm,/data02/yarn/nm >>> >>> when I look at yarn executor container log I see that blockmanager files >>> created in /data01/yarn/nm,/data02/yarn/nm >>> >>> But output files to upload to s3 still created in /tmp on slaves >>> >>> I do not want Spark write heavy files to /tmp because /tmp is only 5GB >>> >>> spark slaves have two big additional disks /disk01 and /disk02 attached >>> >>> Probably I can set spark.local.dir to be /data01/tmp,/data02/tmp >>> >>> But spark master also writes some files to spark.local.dir >>> But my master box has only one additional disk /data01 >>> >>> So, what should I use for spark.local.dir the >>> spark.local.dir=/data01/tmp >>> or >>> spark.local.dir=/data01/tmp,/data02/tmp >>> >>> ? >>> >> >> >> >> -- >> Best Regards >> >> Jeff Zhang >> > > -- Best Regards Jeff Zhang
Re: Is spark.driver.maxResultSize used correctly ?
Check the code again. Looks like currently the task result will be loaded into memory no matter it is DirectTaskResult or InDirectTaskResult. Previous I thought InDirectTaskResult can be loaded into memory later which can save memory, RDD#collectAsIterator is what I thought that may save memory. On Tue, Mar 1, 2016 at 5:00 PM, Reynold Xin wrote: > How big of a deal is this though? If I am reading your email correctly, > either way this job will fail. You simply want it to fail earlier in the > executor side, rather than collecting it and fail on the driver side? > > > On Sunday, February 28, 2016, Jeff Zhang wrote: > >> data skew might be possible, but not the common case. I think we should >> design for the common case, for the skew case, we may can set some >> parameter of fraction to allow user to tune it. >> >> On Sat, Feb 27, 2016 at 4:51 PM, Reynold Xin wrote: >> >>> But sometimes you might have skew and almost all the result data are in >>> one or a few tasks though. >>> >>> >>> On Friday, February 26, 2016, Jeff Zhang wrote: >>> >>>> >>>> My job get this exception very easily even when I set large value of >>>> spark.driver.maxResultSize. After checking the spark code, I found >>>> spark.driver.maxResultSize is also used in Executor side to decide whether >>>> DirectTaskResult/InDirectTaskResult sent. This doesn't make sense to me. >>>> Using spark.driver.maxResultSize / taskNum might be more proper. Because >>>> if spark.driver.maxResultSize is 1g and we have 10 tasks each has 200m >>>> output. Then even the output of each task is less than >>>> spark.driver.maxResultSize so DirectTaskResult will be sent to driver, but >>>> the total result size is 2g which will cause exception in driver side. >>>> >>>> >>>> 16/02/26 10:10:49 INFO DAGScheduler: Job 4 failed: treeAggregate at >>>> LogisticRegression.scala:283, took 33.796379 s >>>> >>>> Exception in thread "main" org.apache.spark.SparkException: Job aborted >>>> due to stage failure: Total size of serialized results of 1 tasks (1085.0 >>>> MB) is bigger than spark.driver.maxResultSize (1024.0 MB) >>>> >>>> >>>> -- >>>> Best Regards >>>> >>>> Jeff Zhang >>>> >>> >> >> >> -- >> Best Regards >> >> Jeff Zhang >> > -- Best Regards Jeff Zhang
Re: Support virtualenv in PySpark
I may not express it clearly. This method is trying to create virtualenv before python worker start, and this virtualenv is application scope, after the spark application job finish, the virtualenv will be cleanup. And the virtualenvs don't need to be the same path for each node (In my POC, it is the yarn container working directory). So that means user don't need to manually install packages on each node (sometimes you even can't install packages on cluster due to security reason). This is the biggest benefit and purpose that user can create virtualenv on demand without touching each node even when you are not administrator. The cons is the extra cost for installing the required packages before starting python worker. But if it is an application which will run for several hours then the extra cost can be ignored. On Tue, Mar 1, 2016 at 4:15 PM, Mohannad Ali wrote: > Hello Jeff, > > Well this would also mean that you have to manage the same virtualenv > (same path) on all nodes and install your packages to it the same way you > would if you would install the packages to the default python path. > > In any case at the moment you can already do what you proposed by creating > identical virtualenvs on all nodes on the same path and change the spark > python path to point to the virtualenv. > > Best Regards, > Mohannad > On Mar 1, 2016 06:07, "Jeff Zhang" wrote: > >> I have created jira for this feature , comments and feedback are welcome >> about how to improve it and whether it's valuable for users. >> >> https://issues.apache.org/jira/browse/SPARK-13587 >> >> >> Here's some background info and status of this work. >> >> >> Currently, it's not easy for user to add third party python packages in >> pyspark. >> >>- One way is to using --py-files (suitable for simple dependency, but >>not suitable for complicated dependency, especially with transitive >>dependency) >>- Another way is install packages manually on each node (time >>wasting, and not easy to switch to different environment) >> >> Python now has 2 different virtualenv implementation. One is native >> virtualenv another is through conda. >> >> I have implemented POC for this features. Here's one simple command for >> how to use virtualenv in pyspark >> >> bin/spark-submit --master yarn --deploy-mode client --conf >> "spark.pyspark.virtualenv.enabled=true" --conf >> "spark.pyspark.virtualenv.type=conda" --conf >> "spark.pyspark.virtualenv.requirements=/Users/jzhang/work/virtualenv/conda.txt" >> --conf "spark.pyspark.virtualenv.path=/Users/jzhang/anaconda/bin/conda" >> ~/work/virtualenv/spark.py >> >> There're 4 properties needs to be set >> >>- spark.pyspark.virtualenv.enabled (enable virtualenv) >>- spark.pyspark.virtualenv.type (native/conda are supported, default >>is native) >>- spark.pyspark.virtualenv.requirements (requirement file for the >>dependencies) >>- spark.pyspark.virtualenv.path (path to the executable for for >>virtualenv/conda) >> >> >> >> >> >> >> Best Regards >> >> Jeff Zhang >> > -- Best Regards Jeff Zhang
Support virtualenv in PySpark
I have created jira for this feature , comments and feedback are welcome about how to improve it and whether it's valuable for users. https://issues.apache.org/jira/browse/SPARK-13587 Here's some background info and status of this work. Currently, it's not easy for user to add third party python packages in pyspark. - One way is to using --py-files (suitable for simple dependency, but not suitable for complicated dependency, especially with transitive dependency) - Another way is install packages manually on each node (time wasting, and not easy to switch to different environment) Python now has 2 different virtualenv implementation. One is native virtualenv another is through conda. I have implemented POC for this features. Here's one simple command for how to use virtualenv in pyspark bin/spark-submit --master yarn --deploy-mode client --conf "spark.pyspark.virtualenv.enabled=true" --conf "spark.pyspark.virtualenv.type=conda" --conf "spark.pyspark.virtualenv.requirements=/Users/jzhang/work/virtualenv/conda.txt" --conf "spark.pyspark.virtualenv.path=/Users/jzhang/anaconda/bin/conda" ~/work/virtualenv/spark.py There're 4 properties needs to be set - spark.pyspark.virtualenv.enabled (enable virtualenv) - spark.pyspark.virtualenv.type (native/conda are supported, default is native) - spark.pyspark.virtualenv.requirements (requirement file for the dependencies) - spark.pyspark.virtualenv.path (path to the executable for for virtualenv/conda) Best Regards Jeff Zhang
Re: What should be spark.local.dir in spark on yarn?
In yarn mode, spark.local.dir is yarn.nodemanager.local-dirs for shuffle data and block manager disk data. What do you mean "But output files to upload to s3 still created in /tmp on slaves" ? You should have control on where to store your output data if that means your job's output. On Tue, Mar 1, 2016 at 3:12 AM, Alexander Pivovarov wrote: > I have Spark on yarn > > I defined yarn.nodemanager.local-dirs to be /data01/yarn/nm,/data02/yarn/nm > > when I look at yarn executor container log I see that blockmanager files > created in /data01/yarn/nm,/data02/yarn/nm > > But output files to upload to s3 still created in /tmp on slaves > > I do not want Spark write heavy files to /tmp because /tmp is only 5GB > > spark slaves have two big additional disks /disk01 and /disk02 attached > > Probably I can set spark.local.dir to be /data01/tmp,/data02/tmp > > But spark master also writes some files to spark.local.dir > But my master box has only one additional disk /data01 > > So, what should I use for spark.local.dir the > spark.local.dir=/data01/tmp > or > spark.local.dir=/data01/tmp,/data02/tmp > > ? > -- Best Regards Jeff Zhang
Re: Control the stdout and stderr streams in a executor JVM
You can create log4j.properties for executors, and use "--files log4j.properties" when submitting spark jobs. On Mon, Feb 29, 2016 at 1:50 PM, Niranda Perera wrote: > Hi all, > > Is there any possibility to control the stdout and stderr streams in an > executor JVM? > > I understand that there are some configurations provided from the spark > conf as follows > spark.executor.logs.rolling.maxRetainedFiles > spark.executor.logs.rolling.maxSize > spark.executor.logs.rolling.strategy > spark.executor.logs.rolling.time.interval > > But is there a possibility to have more fine grained control over these, > like we do in a log4j appender, with a property file? > > Rgds > -- > Niranda > @n1r44 <https://twitter.com/N1R44> > +94-71-554-8430 > https://pythagoreanscript.wordpress.com/ > -- Best Regards Jeff Zhang
Re: Is spark.driver.maxResultSize used correctly ?
data skew might be possible, but not the common case. I think we should design for the common case, for the skew case, we may can set some parameter of fraction to allow user to tune it. On Sat, Feb 27, 2016 at 4:51 PM, Reynold Xin wrote: > But sometimes you might have skew and almost all the result data are in > one or a few tasks though. > > > On Friday, February 26, 2016, Jeff Zhang wrote: > >> >> My job get this exception very easily even when I set large value of >> spark.driver.maxResultSize. After checking the spark code, I found >> spark.driver.maxResultSize is also used in Executor side to decide whether >> DirectTaskResult/InDirectTaskResult sent. This doesn't make sense to me. >> Using spark.driver.maxResultSize / taskNum might be more proper. Because >> if spark.driver.maxResultSize is 1g and we have 10 tasks each has 200m >> output. Then even the output of each task is less than >> spark.driver.maxResultSize so DirectTaskResult will be sent to driver, but >> the total result size is 2g which will cause exception in driver side. >> >> >> 16/02/26 10:10:49 INFO DAGScheduler: Job 4 failed: treeAggregate at >> LogisticRegression.scala:283, took 33.796379 s >> >> Exception in thread "main" org.apache.spark.SparkException: Job aborted >> due to stage failure: Total size of serialized results of 1 tasks (1085.0 >> MB) is bigger than spark.driver.maxResultSize (1024.0 MB) >> >> >> -- >> Best Regards >> >> Jeff Zhang >> > -- Best Regards Jeff Zhang
Is spark.driver.maxResultSize used correctly ?
My job get this exception very easily even when I set large value of spark.driver.maxResultSize. After checking the spark code, I found spark.driver.maxResultSize is also used in Executor side to decide whether DirectTaskResult/InDirectTaskResult sent. This doesn't make sense to me. Using spark.driver.maxResultSize / taskNum might be more proper. Because if spark.driver.maxResultSize is 1g and we have 10 tasks each has 200m output. Then even the output of each task is less than spark.driver.maxResultSize so DirectTaskResult will be sent to driver, but the total result size is 2g which will cause exception in driver side. 16/02/26 10:10:49 INFO DAGScheduler: Job 4 failed: treeAggregate at LogisticRegression.scala:283, took 33.796379 s Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 1 tasks (1085.0 MB) is bigger than spark.driver.maxResultSize (1024.0 MB) -- Best Regards Jeff Zhang
Re: ORC file writing hangs in pyspark
Have you checked the live spark UI and yarn app logs ? On Tue, Feb 23, 2016 at 10:05 PM, James Barney wrote: > I'm trying to write an ORC file after running the FPGrowth algorithm on a > dataset of around just 2GB in size. The algorithm performs well and can > display results if I take(n) the freqItemSets() of the result after > converting that to a DF. > > I'm using Spark 1.5.2 on HDP 2.3.4 and Python 3.4.2 on Yarn. > > I get the results from querying a Hive table, also ORC format, running a > number of maps, joins, and filters on the data. > > When the program attempts to write the files: > result.write.orc('/data/staged/raw_result') > size_1_buckets.write.orc('/data/staged/size_1_results') > filter_size_2_buckets.write.orc('/data/staged/size_2_results') > > The first path, /data/staged/raw_result, is created with a _temporary > folder, but the data is never written. The job hangs at this point, > apparently indefinitely. > > Additionally, no logs are recorded or available for the jobs on the > history server. > > What could be the problem? > -- Best Regards Jeff Zhang
Re: Are we running SparkR tests in Jenkins?
Created https://issues.apache.org/jira/browse/SPARK-12846 On Fri, Jan 15, 2016 at 3:29 PM, Jeff Zhang wrote: > Right, I forget the documentation, will create a follow up jira. > > On Fri, Jan 15, 2016 at 3:23 PM, Shivaram Venkataraman < > shiva...@eecs.berkeley.edu> wrote: > >> Ah I see. I wasn't aware of that PR. We should do a find and replace >> in all the documentation and rest of the repository as well. >> >> Shivaram >> >> On Fri, Jan 15, 2016 at 3:20 PM, Reynold Xin wrote: >> > +Shivaram >> > >> > Ah damn - we should fix it. >> > >> > This was broken by https://github.com/apache/spark/pull/10658 - which >> > removed a functionality that has been deprecated since Spark 1.0. >> > >> > >> > >> > >> > >> > On Fri, Jan 15, 2016 at 3:19 PM, Herman van Hövell tot Westerflier >> > wrote: >> >> >> >> Hi all, >> >> >> >> I just noticed the following log entry in Jenkins: >> >> >> >>> >> >> >>> Running SparkR tests >> >>> >> >> >>> Running R applications through 'sparkR' is not supported as of Spark >> 2.0. >> >>> Use ./bin/spark-submit >> >> >> >> >> >> Are we still running R tests? Or just saying that this will be >> deprecated? >> >> >> >> Kind regards, >> >> >> >> Herman van Hövell tot Westerflier >> >> >> > >> >> - >> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org >> For additional commands, e-mail: dev-h...@spark.apache.org >> >> > > > -- > Best Regards > > Jeff Zhang > -- Best Regards Jeff Zhang
Re: Are we running SparkR tests in Jenkins?
Right, I forget the documentation, will create a follow up jira. On Fri, Jan 15, 2016 at 3:23 PM, Shivaram Venkataraman < shiva...@eecs.berkeley.edu> wrote: > Ah I see. I wasn't aware of that PR. We should do a find and replace > in all the documentation and rest of the repository as well. > > Shivaram > > On Fri, Jan 15, 2016 at 3:20 PM, Reynold Xin wrote: > > +Shivaram > > > > Ah damn - we should fix it. > > > > This was broken by https://github.com/apache/spark/pull/10658 - which > > removed a functionality that has been deprecated since Spark 1.0. > > > > > > > > > > > > On Fri, Jan 15, 2016 at 3:19 PM, Herman van Hövell tot Westerflier > > wrote: > >> > >> Hi all, > >> > >> I just noticed the following log entry in Jenkins: > >> > >>> > > >>> Running SparkR tests > >>> > > >>> Running R applications through 'sparkR' is not supported as of Spark > 2.0. > >>> Use ./bin/spark-submit > >> > >> > >> Are we still running R tests? Or just saying that this will be > deprecated? > >> > >> Kind regards, > >> > >> Herman van Hövell tot Westerflier > >> > > > > - > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > For additional commands, e-mail: dev-h...@spark.apache.org > > -- Best Regards Jeff Zhang
Re: [discuss] dropping Python 2.6 support
>>>>>>>>>> As I pointed out in my earlier email, RHEL will support Python >>>>>>>>>>> 2.6 until 2020. So I'm assuming these large companies will have the >>>>>>>>>>> option >>>>>>>>>>> of riding out Python 2.6 until then. >>>>>>>>>>> >>>>>>>>>>> Are we seriously saying that Spark should likewise support >>>>>>>>>>> Python 2.6 for the next several years? Even though the core Python >>>>>>>>>>> devs >>>>>>>>>>> stopped supporting it in 2013? >>>>>>>>>>> >>>>>>>>>>> If that's not what we're suggesting, then when, roughly, can we >>>>>>>>>>> drop support? What are the criteria? >>>>>>>>>>> >>>>>>>>>>> I understand the practical concern here. If companies are stuck >>>>>>>>>>> using 2.6, it doesn't matter to them that it is deprecated. But >>>>>>>>>>> balancing >>>>>>>>>>> that concern against the maintenance burden on this project, I >>>>>>>>>>> would say >>>>>>>>>>> that "upgrade to Python 2.7 or stay on Spark 1.6.x" is a reasonable >>>>>>>>>>> position to take. There are many tiny annoyances one has to put up >>>>>>>>>>> with to >>>>>>>>>>> support 2.6. >>>>>>>>>>> >>>>>>>>>>> I suppose if our main PySpark contributors are fine putting up >>>>>>>>>>> with those annoyances, then maybe we don't need to drop support >>>>>>>>>>> just yet... >>>>>>>>>>> >>>>>>>>>>> Nick >>>>>>>>>>> 2016년 1월 5일 (화) 오후 2:27, Julio Antonio Soto de Vicente < >>>>>>>>>>> ju...@esbet.es>님이 작성: >>>>>>>>>>> >>>>>>>>>>>> Unfortunately, Koert is right. >>>>>>>>>>>> >>>>>>>>>>>> I've been in a couple of projects using Spark (banking >>>>>>>>>>>> industry) where CentOS + Python 2.6 is the toolbox available. >>>>>>>>>>>> >>>>>>>>>>>> That said, I believe it should not be a concern for Spark. >>>>>>>>>>>> Python 2.6 is old and busted, which is totally opposite to the >>>>>>>>>>>> Spark >>>>>>>>>>>> philosophy IMO. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> El 5 ene 2016, a las 20:07, Koert Kuipers >>>>>>>>>>>> escribió: >>>>>>>>>>>> >>>>>>>>>>>> rhel/centos 6 ships with python 2.6, doesnt it? >>>>>>>>>>>> >>>>>>>>>>>> if so, i still know plenty of large companies where python 2.6 >>>>>>>>>>>> is the only option. asking them for python 2.7 is not going to work >>>>>>>>>>>> >>>>>>>>>>>> so i think its a bad idea >>>>>>>>>>>> >>>>>>>>>>>> On Tue, Jan 5, 2016 at 1:52 PM, Juliet Hougland < >>>>>>>>>>>> juliet.hougl...@gmail.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> I don't see a reason Spark 2.0 would need to support Python >>>>>>>>>>>>> 2.6. At this point, Python 3 should be the default that is >>>>>>>>>>>>> encouraged. >>>>>>>>>>>>> Most organizations acknowledge the 2.7 is common, but lagging >>>>>>>>>>>>> behind the version they should theoretically use. Dropping python >>>>>>>>>>>>> 2.6 >>>>>>>>>>>>> support sounds very reasonable to me. >>>>>>>>>>>>> >>>>>>>>>>>>> On Tue, Jan 5, 2016 at 5:45 AM, Nicholas Chammas < >>>>>>>>>>>>> nicholas.cham...@gmail.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> +1 >>>>>>>>>>>>>> >>>>>>>>>>>>>> Red Hat supports Python 2.6 on REHL 5 until 2020 >>>>>>>>>>>>>> <https://alexgaynor.net/2015/mar/30/red-hat-open-source-community/>, >>>>>>>>>>>>>> but otherwise yes, Python 2.6 is ancient history and the core >>>>>>>>>>>>>> Python >>>>>>>>>>>>>> developers stopped supporting it in 2013. REHL 5 is not a good >>>>>>>>>>>>>> enough >>>>>>>>>>>>>> reason to continue support for Python 2.6 IMO. >>>>>>>>>>>>>> >>>>>>>>>>>>>> We should aim to support Python 2.7 and Python 3.3+ (which I >>>>>>>>>>>>>> believe we currently do). >>>>>>>>>>>>>> >>>>>>>>>>>>>> Nick >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Tue, Jan 5, 2016 at 8:01 AM Allen Zhang < >>>>>>>>>>>>>> allenzhang...@126.com> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> plus 1, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> we are currently using python 2.7.2 in production >>>>>>>>>>>>>>> environment. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> 在 2016-01-05 18:11:45,"Meethu Mathew" < >>>>>>>>>>>>>>> meethu.mat...@flytxt.com> 写道: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> +1 >>>>>>>>>>>>>>> We use Python 2.7 >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Regards, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Meethu Mathew >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Tue, Jan 5, 2016 at 12:47 PM, Reynold Xin < >>>>>>>>>>>>>>> r...@databricks.com> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Does anybody here care about us dropping support for Python >>>>>>>>>>>>>>>> 2.6 in Spark 2.0? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Python 2.6 is ancient, and is pretty slow in many aspects >>>>>>>>>>>>>>>> (e.g. json parsing) when compared with Python 2.7. Some >>>>>>>>>>>>>>>> libraries that >>>>>>>>>>>>>>>> Spark depend on stopped supporting 2.6. We can still convince >>>>>>>>>>>>>>>> the library >>>>>>>>>>>>>>>> maintainers to support 2.6, but it will be extra work. I'm >>>>>>>>>>>>>>>> curious if >>>>>>>>>>>>>>>> anybody still uses Python 2.6 to run Spark. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Thanks. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>> >>>> >>> >>> >> > -- Best Regards Jeff Zhang
Re: How to execute non-hadoop command ?
Sorry, wrong list On Tue, Jan 5, 2016 at 12:36 PM, Jeff Zhang wrote: > I want to create service check for spark, but spark don't use hadoop > script as launch script. I found other component use ExecuteHadoop to > launch hadoop job to verify the service, I am wondering is there is there > any api for non-hadoop command ? BTW I check the source code > of execute_hadoop.py but don't find how it associates with hadoop > > -- > Best Regards > > Jeff Zhang > -- Best Regards Jeff Zhang
How to execute non-hadoop command ?
I want to create service check for spark, but spark don't use hadoop script as launch script. I found other component use ExecuteHadoop to launch hadoop job to verify the service, I am wondering is there is there any api for non-hadoop command ? BTW I check the source code of execute_hadoop.py but don't find how it associates with hadoop -- Best Regards Jeff Zhang
Re: 答复: How can I get the column data based on specific column name and then stored these data in array or list ?
You can use udf to convert one column for array type. Here's one sample val conf = new SparkConf().setMaster("local[4]").setAppName("test") val sc = new SparkContext(conf) val sqlContext = new SQLContext(sc) import sqlContext.implicits._ import sqlContext._ sqlContext.udf.register("f", (a:String) => Array(a,a)) val df1 = Seq( (1, "jeff", 12), (2, "andy", 34), (3, "pony", 23), (4, "jeff", 14) ).toDF("id", "name", "age") val df2=df1.withColumn("name", expr("f(name)")) df2.printSchema() df2.show() On Fri, Dec 25, 2015 at 3:44 PM, zml张明磊 wrote: > Thanks, Jeff. It’s not choose some columns of a Row. It’s just choose all > data in a column and convert it to an Array. Do you understand my mean ? > > > > In Chinese > > 我是想基于这个列名把这一列中的所有数据都选出来,然后放到数组里面去。 > > > > > > *发件人:* Jeff Zhang [mailto:zjf...@gmail.com] > *发送时间:* 2015年12月25日 15:39 > *收件人:* zml张明磊 > *抄送:* dev@spark.apache.org > *主题:* Re: How can I get the column data based on specific column name and > then stored these data in array or list ? > > > > Not sure what you mean. Do you want to choose some columns of a Row and > convert it to an Arrray ? > > > > On Fri, Dec 25, 2015 at 3:35 PM, zml张明磊 wrote: > > > > Hi, > > > >I am a new to Scala and Spark and trying to find relative API in > DataFrame > to solve my problem as title described. However, I just only find this API > *DataFrame.col(colName > : String) : Column * which returns an object of Column. Not the content. > If only DataFrame support such API which like *Column.toArray : Type* is > enough for me. But now, it doesn’t. How can I do can achieve this function > ? > > > > Thanks, > > Minglei. > > > > > > -- > > Best Regards > > Jeff Zhang > -- Best Regards Jeff Zhang
Re: How can I get the column data based on specific column name and then stored these data in array or list ?
Not sure what you mean. Do you want to choose some columns of a Row and convert it to an Arrray ? On Fri, Dec 25, 2015 at 3:35 PM, zml张明磊 wrote: > > > Hi, > > > >I am a new to Scala and Spark and trying to find relative API in > DataFrame > to solve my problem as title described. However, I just only find this API > *DataFrame.col(colName > : String) : Column * which returns an object of Column. Not the content. > If only DataFrame support such API which like *Column.toArray : Type* is > enough for me. But now, it doesn’t. How can I do can achieve this function > ? > > > > Thanks, > > Minglei. > -- Best Regards Jeff Zhang
Re: [VOTE] Release Apache Spark 1.6.0 (RC4)
- SPARK-3147 <https://issues.apache.org/jira/browse/SPARK-3147> Online >>hypothesis testing - A/B testing in the Spark Streaming framework >>- SPARK-9930 <https://issues.apache.org/jira/browse/SPARK-9930> New >>feature transformers - ChiSqSelector, QuantileDiscretizer, SQL >>transformer >>- SPARK-6517 <https://issues.apache.org/jira/browse/SPARK-6517> Bisecting >>K-Means clustering - Fast top-down clustering variant of K-Means >> >> API improvements >> >>- ML Pipelines >> - SPARK-6725 <https://issues.apache.org/jira/browse/SPARK-6725> >> Pipeline >> persistence - Save/load for ML Pipelines, with partial coverage of >> spark.mlalgorithms >> - SPARK-5565 <https://issues.apache.org/jira/browse/SPARK-5565> LDA >> in ML Pipelines - API for Latent Dirichlet Allocation in ML >> Pipelines >>- R API >> - SPARK-9836 <https://issues.apache.org/jira/browse/SPARK-9836> R-like >> statistics for GLMs - (Partial) R-like stats for ordinary least >> squares via summary(model) >> - SPARK-9681 <https://issues.apache.org/jira/browse/SPARK-9681> >> Feature >> interactions in R formula - Interaction operator ":" in R formula >>- Python API - Many improvements to Python API to approach feature >>parity >> >> Misc improvements >> >>- SPARK-7685 <https://issues.apache.org/jira/browse/SPARK-7685>, >>SPARK-9642 <https://issues.apache.org/jira/browse/SPARK-9642> Instance >>weights for GLMs - Logistic and Linear Regression can take instance >>weights >>- SPARK-10384 <https://issues.apache.org/jira/browse/SPARK-10384>, >>SPARK-10385 <https://issues.apache.org/jira/browse/SPARK-10385> Univariate >>and bivariate statistics in DataFrames - Variance, stddev, >>correlations, etc. >>- SPARK-10117 <https://issues.apache.org/jira/browse/SPARK-10117> LIBSVM >>data source - LIBSVM as a SQL data sourceDocumentation improvements >>- SPARK-7751 <https://issues.apache.org/jira/browse/SPARK-7751> @since >>versions - Documentation includes initial version when classes and >>methods were added >>- SPARK-11337 <https://issues.apache.org/jira/browse/SPARK-11337> Testable >>example code - Automated testing for code in user guide examples >> >> Deprecations >> >>- In spark.mllib.clustering.KMeans, the "runs" parameter has been >>deprecated. >>- In spark.ml.classification.LogisticRegressionModel and >>spark.ml.regression.LinearRegressionModel, the "weights" field has been >>deprecated, in favor of the new name "coefficients." This helps >>disambiguate from instance (row) weights given to algorithms. >> >> Changes of behavior >> >>- spark.mllib.tree.GradientBoostedTrees validationTol has changed >>semantics in 1.6. Previously, it was a threshold for absolute change in >>error. Now, it resembles the behavior of GradientDescent convergenceTol: >>For large errors, it uses relative error (relative to the previous error); >>for small errors (< 0.01), it uses absolute error. >>- spark.ml.feature.RegexTokenizer: Previously, it did not convert >>strings to lowercase before tokenizing. Now, it converts to lowercase by >>default, with an option not to. This matches the behavior of the simpler >>Tokenizer transformer. >>- Spark SQL's partition discovery has been changed to only discover >>partition directories that are children of the given path. (i.e. if >>path="/my/data/x=1" then x=1 will no longer be considered a partition >>but only children of x=1.) This behavior can be overridden by >>manually specifying the basePath that partitioning discovery should >>start with (SPARK-11678 >><https://issues.apache.org/jira/browse/SPARK-11678>). >>- When casting a value of an integral type to timestamp (e.g. casting >>a long value to timestamp), the value is treated as being in seconds >>instead of milliseconds (SPARK-11724 >><https://issues.apache.org/jira/browse/SPARK-11724>). >>- With the improved query planner for queries having distinct >>aggregations (SPARK-9241 >><https://issues.apache.org/jira/browse/SPARK-9241>), the plan of a >>query having a single distinct aggregation has been changed to a more >>robust version. To switch back to the plan generated by Spark 1.5's >>planner, please set spark.sql.specializeSingleDistinctAggPlanning to >>true (SPARK-12077 <https://issues.apache.org/jira/browse/SPARK-12077> >>). >> >> > -- Best Regards Jeff Zhang
Re: [VOTE] Release Apache Spark 1.6.0 (RC3)
;- SPARK-9930 <https://issues.apache.org/jira/browse/SPARK-9930> New >>feature transformers - ChiSqSelector, QuantileDiscretizer, SQL >>transformer >>- SPARK-6517 <https://issues.apache.org/jira/browse/SPARK-6517> Bisecting >>K-Means clustering - Fast top-down clustering variant of K-Means >> >> API improvements >> >>- ML Pipelines >> - SPARK-6725 <https://issues.apache.org/jira/browse/SPARK-6725> >> Pipeline >> persistence - Save/load for ML Pipelines, with partial coverage of >> spark.mlalgorithms >> - SPARK-5565 <https://issues.apache.org/jira/browse/SPARK-5565> LDA >> in ML Pipelines - API for Latent Dirichlet Allocation in ML >> Pipelines >>- R API >> - SPARK-9836 <https://issues.apache.org/jira/browse/SPARK-9836> R-like >> statistics for GLMs - (Partial) R-like stats for ordinary least >> squares via summary(model) >> - SPARK-9681 <https://issues.apache.org/jira/browse/SPARK-9681> >> Feature >> interactions in R formula - Interaction operator ":" in R formula >>- Python API - Many improvements to Python API to approach feature >>parity >> >> Misc improvements >> >>- SPARK-7685 <https://issues.apache.org/jira/browse/SPARK-7685>, >>SPARK-9642 <https://issues.apache.org/jira/browse/SPARK-9642> Instance >>weights for GLMs - Logistic and Linear Regression can take instance >>weights >>- SPARK-10384 <https://issues.apache.org/jira/browse/SPARK-10384>, >>SPARK-10385 <https://issues.apache.org/jira/browse/SPARK-10385> Univariate >>and bivariate statistics in DataFrames - Variance, stddev, >>correlations, etc. >>- SPARK-10117 <https://issues.apache.org/jira/browse/SPARK-10117> LIBSVM >>data source - LIBSVM as a SQL data sourceDocumentation improvements >>- SPARK-7751 <https://issues.apache.org/jira/browse/SPARK-7751> @since >>versions - Documentation includes initial version when classes and >>methods were added >>- SPARK-11337 <https://issues.apache.org/jira/browse/SPARK-11337> Testable >>example code - Automated testing for code in user guide examples >> >> Deprecations >> >>- In spark.mllib.clustering.KMeans, the "runs" parameter has been >>deprecated. >>- In spark.ml.classification.LogisticRegressionModel and >>spark.ml.regression.LinearRegressionModel, the "weights" field has been >>deprecated, in favor of the new name "coefficients." This helps >>disambiguate from instance (row) weights given to algorithms. >> >> Changes of behavior >> >>- spark.mllib.tree.GradientBoostedTrees validationTol has changed >>semantics in 1.6. Previously, it was a threshold for absolute change in >>error. Now, it resembles the behavior of GradientDescent convergenceTol: >>For large errors, it uses relative error (relative to the previous error); >>for small errors (< 0.01), it uses absolute error. >>- spark.ml.feature.RegexTokenizer: Previously, it did not convert >>strings to lowercase before tokenizing. Now, it converts to lowercase by >>default, with an option not to. This matches the behavior of the simpler >>Tokenizer transformer. >>- Spark SQL's partition discovery has been changed to only discover >>partition directories that are children of the given path. (i.e. if >>path="/my/data/x=1" then x=1 will no longer be considered a partition >>but only children of x=1.) This behavior can be overridden by >>manually specifying the basePath that partitioning discovery should >>start with (SPARK-11678 >><https://issues.apache.org/jira/browse/SPARK-11678>). >>- When casting a value of an integral type to timestamp (e.g. casting >>a long value to timestamp), the value is treated as being in seconds >>instead of milliseconds (SPARK-11724 >><https://issues.apache.org/jira/browse/SPARK-11724>). >>- With the improved query planner for queries having distinct >>aggregations (SPARK-9241 >><https://issues.apache.org/jira/browse/SPARK-9241>), the plan of a >>query having a single distinct aggregation has been changed to a more >>robust version. To switch back to the plan generated by Spark 1.5's >>planner, please set spark.sql.specializeSingleDistinctAggPlanning to >>true (SPARK-12077 <https://issues.apache.org/jira/browse/SPARK-12077> >>). >> >> > > > -- > Luciano Resende > http://people.apache.org/~lresende > http://twitter.com/lresende1975 > http://lresende.blogspot.com/ > -- Best Regards Jeff Zhang
Re: [SparkR] Any reason why saveDF's mode is append by default ?
Thanks Shivaram, created https://issues.apache.org/jira/browse/SPARK-12318 I will work on it. On Mon, Dec 14, 2015 at 4:13 PM, Shivaram Venkataraman < shiva...@eecs.berkeley.edu> wrote: > I think its just a bug -- I think we originally followed the Python > API (in the original PR [1]) but the Python API seems to have been > changed to match Scala / Java in > https://issues.apache.org/jira/browse/SPARK-6366 > > Feel free to open a JIRA / PR for this. > > Thanks > Shivaram > > [1] https://github.com/amplab-extras/SparkR-pkg/pull/199/files > > On Sun, Dec 13, 2015 at 11:58 PM, Jeff Zhang wrote: > > It is inconsistent with scala api which is error by default. Any reason > for > > that ? Thanks > > > > > > > > -- > > Best Regards > > > > Jeff Zhang > -- Best Regards Jeff Zhang
[SparkR] Any reason why saveDF's mode is append by default ?
It is inconsistent with scala api which is error by default. Any reason for that ? Thanks -- Best Regards Jeff Zhang
Re: Spark doesn't unset HADOOP_CONF_DIR when testing ?
Thanks Josh, created https://issues.apache.org/jira/browse/SPARK-12166 On Mon, Dec 7, 2015 at 4:32 AM, Josh Rosen wrote: > I agree that we should unset this in our tests. Want to file a JIRA and > submit a PR to do this? > > On Thu, Dec 3, 2015 at 6:40 PM Jeff Zhang wrote: > >> I try to do test on HiveSparkSubmitSuite on local box, but fails. The >> cause is that spark is still using my local single node cluster hadoop when >> doing the unit test. I don't think it make sense to do that. These >> environment variable should be unset before the testing. And I suspect >> dev/run-tests also >> didn't do that either. >> >> Here's the error message: >> >> Cause: java.lang.RuntimeException: java.lang.RuntimeException: The root >> scratch dir: /tmp/hive on HDFS should be writable. Current permissions are: >> rwxr-xr-x >> [info] at >> org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522) >> [info] at >> org.apache.spark.sql.hive.client.ClientWrapper.(ClientWrapper.scala:171) >> [info] at >> org.apache.spark.sql.hive.HiveContext.executionHive$lzycompute(HiveContext.scala:162) >> [info] at >> org.apache.spark.sql.hive.HiveContext.executionHive(HiveContext.scala:160) >> >> >> >> -- >> Best Regards >> >> Jeff Zhang >> > -- Best Regards Jeff Zhang
Spark doesn't unset HADOOP_CONF_DIR when testing ?
I try to do test on HiveSparkSubmitSuite on local box, but fails. The cause is that spark is still using my local single node cluster hadoop when doing the unit test. I don't think it make sense to do that. These environment variable should be unset before the testing. And I suspect dev/run-tests also didn't do that either. Here's the error message: Cause: java.lang.RuntimeException: java.lang.RuntimeException: The root scratch dir: /tmp/hive on HDFS should be writable. Current permissions are: rwxr-xr-x [info] at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522) [info] at org.apache.spark.sql.hive.client.ClientWrapper.(ClientWrapper.scala:171) [info] at org.apache.spark.sql.hive.HiveContext.executionHive$lzycompute(HiveContext.scala:162) [info] at org.apache.spark.sql.hive.HiveContext.executionHive(HiveContext.scala:160) -- Best Regards Jeff Zhang
Re: Problem in running MLlib SVM
I think this should represent the label of LabledPoint (0 means negative 1 means positive) http://spark.apache.org/docs/latest/mllib-data-types.html#labeled-point The document you mention is for the mathematical formula, not the implementation. On Sun, Nov 29, 2015 at 9:13 AM, Tarek Elgamal wrote: > According to the documentation > <http://spark.apache.org/docs/latest/mllib-linear-methods.html>, by > default, if wTx≥0 then the outcome is positive, and negative otherwise. I > suppose that wTx is the "score" in my case. If score is more than 0 and the > label is positive, then I return 1 which is correct classification and I > return zero otherwise. Do you have any idea how to classify a point as > positive or negative using this score or another function ? > > On Sat, Nov 28, 2015 at 5:14 AM, Jeff Zhang wrote: > >> if((score >=0 && label == 1) || (score <0 && label == 0)) >> { >> return 1; //correct classiciation >> } >> else >> return 0; >> >> >> >> I suspect score is always between 0 and 1 >> >> >> >> On Sat, Nov 28, 2015 at 10:39 AM, Tarek Elgamal >> wrote: >> >>> Hi, >>> >>> I am trying to run the straightforward example of SVm but I am getting >>> low accuracy (around 50%) when I predict using the same data I used for >>> training. I am probably doing the prediction in a wrong way. My code is >>> below. I would appreciate any help. >>> >>> >>> import java.util.List; >>> >>> import org.apache.spark.SparkConf; >>> import org.apache.spark.SparkContext; >>> import org.apache.spark.api.java.JavaRDD; >>> import org.apache.spark.api.java.function.Function; >>> import org.apache.spark.api.java.function.Function2; >>> import org.apache.spark.mllib.classification.SVMModel; >>> import org.apache.spark.mllib.classification.SVMWithSGD; >>> import org.apache.spark.mllib.regression.LabeledPoint; >>> import org.apache.spark.mllib.util.MLUtils; >>> >>> import scala.Tuple2; >>> import edu.illinois.biglbjava.readers.LabeledPointReader; >>> >>> public class SimpleDistSVM { >>> public static void main(String[] args) { >>> SparkConf conf = new SparkConf().setAppName("SVM Classifier >>> Example"); >>> SparkContext sc = new SparkContext(conf); >>> String inputPath=args[0]; >>> >>> // Read training data >>> JavaRDD data = MLUtils.loadLibSVMFile(sc, >>> inputPath).toJavaRDD(); >>> >>> // Run training algorithm to build the model. >>> int numIterations = 3; >>> final SVMModel model = SVMWithSGD.train(data.rdd(), numIterations); >>> >>> // Clear the default threshold. >>> model.clearThreshold(); >>> >>> >>> // Predict points in test set and map to an RDD of 0/1 values where >>> 0 is misclassication and 1 is correct classification >>> JavaRDD classification = data.map(new >>> Function() { >>> public Integer call(LabeledPoint p) { >>>int label = (int) p.label(); >>>Double score = model.predict(p.features()); >>>if((score >=0 && label == 1) || (score <0 && label == 0)) >>>{ >>>return 1; //correct classiciation >>>} >>>else >>> return 0; >>> >>> } >>>} >>> ); >>> // sum up all values in the rdd to get the number of correctly >>> classified examples >>> int sum=classification.reduce(new Function2>> Integer>() >>> { >>> public Integer call(Integer arg0, Integer arg1) >>> throws Exception { >>> return arg0+arg1; >>> }}); >>> >>> //compute accuracy as the percentage of the correctly classified >>> examples >>> double accuracy=((double)sum)/((double)classification.count()); >>> System.out.println("Accuracy = " + accuracy); >>> >>> } >>> } >>> ); >>> } >>> } >>> >> >> >> >> -- >> Best Regards >> >> Jeff Zhang >> > > -- Best Regards Jeff Zhang
Re: Problem in running MLlib SVM
if((score >=0 && label == 1) || (score <0 && label == 0)) { return 1; //correct classiciation } else return 0; I suspect score is always between 0 and 1 On Sat, Nov 28, 2015 at 10:39 AM, Tarek Elgamal wrote: > Hi, > > I am trying to run the straightforward example of SVm but I am getting low > accuracy (around 50%) when I predict using the same data I used for > training. I am probably doing the prediction in a wrong way. My code is > below. I would appreciate any help. > > > import java.util.List; > > import org.apache.spark.SparkConf; > import org.apache.spark.SparkContext; > import org.apache.spark.api.java.JavaRDD; > import org.apache.spark.api.java.function.Function; > import org.apache.spark.api.java.function.Function2; > import org.apache.spark.mllib.classification.SVMModel; > import org.apache.spark.mllib.classification.SVMWithSGD; > import org.apache.spark.mllib.regression.LabeledPoint; > import org.apache.spark.mllib.util.MLUtils; > > import scala.Tuple2; > import edu.illinois.biglbjava.readers.LabeledPointReader; > > public class SimpleDistSVM { > public static void main(String[] args) { > SparkConf conf = new SparkConf().setAppName("SVM Classifier Example"); > SparkContext sc = new SparkContext(conf); > String inputPath=args[0]; > > // Read training data > JavaRDD data = MLUtils.loadLibSVMFile(sc, > inputPath).toJavaRDD(); > > // Run training algorithm to build the model. > int numIterations = 3; > final SVMModel model = SVMWithSGD.train(data.rdd(), numIterations); > > // Clear the default threshold. > model.clearThreshold(); > > > // Predict points in test set and map to an RDD of 0/1 values where 0 > is misclassication and 1 is correct classification > JavaRDD classification = data.map(new Function Integer>() { > public Integer call(LabeledPoint p) { >int label = (int) p.label(); >Double score = model.predict(p.features()); >if((score >=0 && label == 1) || (score <0 && label == 0)) >{ >return 1; //correct classiciation >} >else > return 0; > > } >} > ); > // sum up all values in the rdd to get the number of correctly > classified examples > int sum=classification.reduce(new Function2 Integer>() > { > public Integer call(Integer arg0, Integer arg1) > throws Exception { > return arg0+arg1; > }}); > > //compute accuracy as the percentage of the correctly classified > examples > double accuracy=((double)sum)/((double)classification.count()); > System.out.println("Accuracy = " + accuracy); > > } > } > ); > } > } > -- Best Regards Jeff Zhang
Re: FW: SequenceFile and object reuse
Would this be an issue on the raw data ? I use the following simple code, and don't hit the issue you mentioned. Or it would be better to share your code. val rdd =sc.sequenceFile("/Users/hadoop/Temp/Seq", classOf[IntWritable], classOf[Text]) rdd.map{case (k,v) => (k.get(), v.toString)}.collect() foreach println On Thu, Nov 19, 2015 at 12:04 PM, jeff saremi wrote: > I sent this to the user forum. I got no responses. Could someone here > please help? thanks > jeff > > -- > From: jeffsar...@hotmail.com > To: u...@spark.apache.org > Subject: SequenceFile and object reuse > Date: Fri, 13 Nov 2015 13:29:58 -0500 > > > So we tried reading a sequencefile in Spark and realized that all our > records have ended up becoming the same. > THen one of us found this: > > Note: Because Hadoop's RecordReader class re-uses the same Writable object > for each record, directly caching the returned RDD or directly passing it > to an aggregation or shuffle operation will create many references to the > same object. If you plan to directly cache, sort, or aggregate Hadoop > writable objects, you should first copy them using a map function. > > Is there anyone that can shed some light on this bizzare behavior and the > decisions behind it? > And I also would like to know if anyone's able to read a binary file and > not to incur the additional map() as suggested by the above? What format > did you use? > > thanks > Jeff > -- Best Regards Jeff Zhang
Re: Does anyone meet the issue that jars under lib_managed is never downloaded ?
Created https://issues.apache.org/jira/browse/SPARK-11798 On Wed, Nov 18, 2015 at 9:42 AM, Josh Rosen wrote: > Can you file a JIRA issue to help me triage this further? Thanks! > > On Tue, Nov 17, 2015 at 4:08 PM Jeff Zhang wrote: > >> Sure, hive profile is enabled. >> >> On Wed, Nov 18, 2015 at 6:12 AM, Josh Rosen >> wrote: >> >>> Is the Hive profile enabled? I think it may need to be turned on in >>> order for those JARs to be deployed. >>> >>> On Tue, Nov 17, 2015 at 2:27 AM Jeff Zhang wrote: >>> >>>> BTW, After I revert SPARK-7841, I can see all the jars under >>>> lib_managed/jars >>>> >>>> On Tue, Nov 17, 2015 at 2:46 PM, Jeff Zhang wrote: >>>> >>>>> Hi Josh, >>>>> >>>>> I notice the comments in https://github.com/apache/spark/pull/9575 said >>>>> that Datanucleus related jars will still be copied to >>>>> lib_managed/jars. But I don't see any jars under lib_managed/jars. >>>>> The weird thing is that I see the jars on another machine, but could not >>>>> see jars on my laptop even after I delete the whole spark project and >>>>> start >>>>> from scratch. Does it related with environments ? I try to add the >>>>> following code in SparkBuild.scala to track the issue, it shows that the >>>>> jars is empty. Any thoughts on that ? >>>>> >>>>> >>>>> deployDatanucleusJars := { >>>>> val jars: Seq[File] = (fullClasspath in >>>>> assembly).value.map(_.data) >>>>> .filter(_.getPath.contains("org.datanucleus")) >>>>> // this is what I added >>>>> println("*") >>>>> println("fullClasspath:"+fullClasspath) >>>>> println("assembly:"+assembly) >>>>> println("jars:"+jars.map(_.getAbsolutePath()).mkString(",")) >>>>> // >>>>> >>>>> >>>>> On Mon, Nov 16, 2015 at 4:51 PM, Jeff Zhang wrote: >>>>> >>>>>> This is the exception I got >>>>>> >>>>>> 15/11/16 16:50:48 WARN metastore.HiveMetaStore: Retrying creating >>>>>> default database after error: Class >>>>>> org.datanucleus.api.jdo.JDOPersistenceManagerFactory was not found. >>>>>> javax.jdo.JDOFatalUserException: Class >>>>>> org.datanucleus.api.jdo.JDOPersistenceManagerFactory was not found. >>>>>> at >>>>>> javax.jdo.JDOHelper.invokeGetPersistenceManagerFactoryOnImplementation(JDOHelper.java:1175) >>>>>> at >>>>>> javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:808) >>>>>> at >>>>>> javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:701) >>>>>> at >>>>>> org.apache.hadoop.hive.metastore.ObjectStore.getPMF(ObjectStore.java:365) >>>>>> at >>>>>> org.apache.hadoop.hive.metastore.ObjectStore.getPersistenceManager(ObjectStore.java:394) >>>>>> at >>>>>> org.apache.hadoop.hive.metastore.ObjectStore.initialize(ObjectStore.java:291) >>>>>> at >>>>>> org.apache.hadoop.hive.metastore.ObjectStore.setConf(ObjectStore.java:258) >>>>>> at >>>>>> org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:73) >>>>>> at >>>>>> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133) >>>>>> at >>>>>> org.apache.hadoop.hive.metastore.RawStoreProxy.(RawStoreProxy.java:57) >>>>>> at >>>>>> org.apache.hadoop.hive.metastore.RawStoreProxy.getProxy(RawStoreProxy.java:66) >>>>>> at >>>>>> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.newRawStore(HiveMetaStore.java:593) >>>>>> at >>>>>> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.getMS(HiveMetaStore.java:571) >>>>>> at >>>>>> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.createDefaultDB(HiveMetaStore.java:620) >>>>>> at >>>>>> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.init(HiveMetaStore.java:461) >>>>>> at >>>>>> org.apache.hadoop.
Re: Does anyone meet the issue that jars under lib_managed is never downloaded ?
Sure, hive profile is enabled. On Wed, Nov 18, 2015 at 6:12 AM, Josh Rosen wrote: > Is the Hive profile enabled? I think it may need to be turned on in order > for those JARs to be deployed. > > On Tue, Nov 17, 2015 at 2:27 AM Jeff Zhang wrote: > >> BTW, After I revert SPARK-7841, I can see all the jars under >> lib_managed/jars >> >> On Tue, Nov 17, 2015 at 2:46 PM, Jeff Zhang wrote: >> >>> Hi Josh, >>> >>> I notice the comments in https://github.com/apache/spark/pull/9575 said >>> that Datanucleus related jars will still be copied to lib_managed/jars. >>> But I don't see any jars under lib_managed/jars. The weird thing is that I >>> see the jars on another machine, but could not see jars on my laptop even >>> after I delete the whole spark project and start from scratch. Does it >>> related with environments ? I try to add the following code in >>> SparkBuild.scala to track the issue, it shows that the jars is empty. Any >>> thoughts on that ? >>> >>> >>> deployDatanucleusJars := { >>> val jars: Seq[File] = (fullClasspath in assembly).value.map(_.data) >>> .filter(_.getPath.contains("org.datanucleus")) >>> // this is what I added >>> println("*") >>> println("fullClasspath:"+fullClasspath) >>> println("assembly:"+assembly) >>> println("jars:"+jars.map(_.getAbsolutePath()).mkString(",")) >>> // >>> >>> >>> On Mon, Nov 16, 2015 at 4:51 PM, Jeff Zhang wrote: >>> >>>> This is the exception I got >>>> >>>> 15/11/16 16:50:48 WARN metastore.HiveMetaStore: Retrying creating >>>> default database after error: Class >>>> org.datanucleus.api.jdo.JDOPersistenceManagerFactory was not found. >>>> javax.jdo.JDOFatalUserException: Class >>>> org.datanucleus.api.jdo.JDOPersistenceManagerFactory was not found. >>>> at >>>> javax.jdo.JDOHelper.invokeGetPersistenceManagerFactoryOnImplementation(JDOHelper.java:1175) >>>> at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:808) >>>> at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:701) >>>> at >>>> org.apache.hadoop.hive.metastore.ObjectStore.getPMF(ObjectStore.java:365) >>>> at >>>> org.apache.hadoop.hive.metastore.ObjectStore.getPersistenceManager(ObjectStore.java:394) >>>> at >>>> org.apache.hadoop.hive.metastore.ObjectStore.initialize(ObjectStore.java:291) >>>> at >>>> org.apache.hadoop.hive.metastore.ObjectStore.setConf(ObjectStore.java:258) >>>> at >>>> org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:73) >>>> at >>>> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133) >>>> at >>>> org.apache.hadoop.hive.metastore.RawStoreProxy.(RawStoreProxy.java:57) >>>> at >>>> org.apache.hadoop.hive.metastore.RawStoreProxy.getProxy(RawStoreProxy.java:66) >>>> at >>>> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.newRawStore(HiveMetaStore.java:593) >>>> at >>>> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.getMS(HiveMetaStore.java:571) >>>> at >>>> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.createDefaultDB(HiveMetaStore.java:620) >>>> at >>>> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.init(HiveMetaStore.java:461) >>>> at >>>> org.apache.hadoop.hive.metastore.RetryingHMSHandler.(RetryingHMSHandler.java:66) >>>> at >>>> org.apache.hadoop.hive.metastore.RetryingHMSHandler.getProxy(RetryingHMSHandler.java:72) >>>> at >>>> org.apache.hadoop.hive.metastore.HiveMetaStore.newRetryingHMSHandler(HiveMetaStore.java:5762) >>>> at >>>> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:199) >>>> at >>>> org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.(SessionHiveMetaStoreClient.java:74) >>>> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) >>>> at >>>> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) >>>> at >>>> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) >>>> at java.lan
Re: Does anyone meet the issue that jars under lib_managed is never downloaded ?
BTW, After I revert SPARK-7841, I can see all the jars under lib_managed/jars On Tue, Nov 17, 2015 at 2:46 PM, Jeff Zhang wrote: > Hi Josh, > > I notice the comments in https://github.com/apache/spark/pull/9575 said > that Datanucleus related jars will still be copied to lib_managed/jars. > But I don't see any jars under lib_managed/jars. The weird thing is that I > see the jars on another machine, but could not see jars on my laptop even > after I delete the whole spark project and start from scratch. Does it > related with environments ? I try to add the following code in > SparkBuild.scala to track the issue, it shows that the jars is empty. Any > thoughts on that ? > > > deployDatanucleusJars := { > val jars: Seq[File] = (fullClasspath in assembly).value.map(_.data) > .filter(_.getPath.contains("org.datanucleus")) > // this is what I added > println("*") > println("fullClasspath:"+fullClasspath) > println("assembly:"+assembly) > println("jars:"+jars.map(_.getAbsolutePath()).mkString(",")) > // > > > On Mon, Nov 16, 2015 at 4:51 PM, Jeff Zhang wrote: > >> This is the exception I got >> >> 15/11/16 16:50:48 WARN metastore.HiveMetaStore: Retrying creating default >> database after error: Class >> org.datanucleus.api.jdo.JDOPersistenceManagerFactory was not found. >> javax.jdo.JDOFatalUserException: Class >> org.datanucleus.api.jdo.JDOPersistenceManagerFactory was not found. >> at >> javax.jdo.JDOHelper.invokeGetPersistenceManagerFactoryOnImplementation(JDOHelper.java:1175) >> at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:808) >> at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:701) >> at >> org.apache.hadoop.hive.metastore.ObjectStore.getPMF(ObjectStore.java:365) >> at >> org.apache.hadoop.hive.metastore.ObjectStore.getPersistenceManager(ObjectStore.java:394) >> at >> org.apache.hadoop.hive.metastore.ObjectStore.initialize(ObjectStore.java:291) >> at >> org.apache.hadoop.hive.metastore.ObjectStore.setConf(ObjectStore.java:258) >> at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:73) >> at >> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133) >> at >> org.apache.hadoop.hive.metastore.RawStoreProxy.(RawStoreProxy.java:57) >> at >> org.apache.hadoop.hive.metastore.RawStoreProxy.getProxy(RawStoreProxy.java:66) >> at >> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.newRawStore(HiveMetaStore.java:593) >> at >> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.getMS(HiveMetaStore.java:571) >> at >> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.createDefaultDB(HiveMetaStore.java:620) >> at >> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.init(HiveMetaStore.java:461) >> at >> org.apache.hadoop.hive.metastore.RetryingHMSHandler.(RetryingHMSHandler.java:66) >> at >> org.apache.hadoop.hive.metastore.RetryingHMSHandler.getProxy(RetryingHMSHandler.java:72) >> at >> org.apache.hadoop.hive.metastore.HiveMetaStore.newRetryingHMSHandler(HiveMetaStore.java:5762) >> at >> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:199) >> at >> org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.(SessionHiveMetaStoreClient.java:74) >> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) >> at >> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) >> at >> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) >> at java.lang.reflect.Constructor.newInstance(Constructor.java:526) >> at >> org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1521) >> at >> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.(RetryingMetaStoreClient.java:86) >> >> On Mon, Nov 16, 2015 at 4:47 PM, Jeff Zhang wrote: >> >>> It's about the datanucleus related jars which is needed by spark sql. >>> Without these jars, I could not call data frame related api ( I make >>> HiveContext enabled) >>> >>> >>> >>> On Mon, Nov 16, 2015 at 4:10 PM, Josh Rosen >>> wrote: >>> >>>> As of https://github.com/apache/spark/pull/9575, Spark's build will no >>>> longer place every dependency JAR into lib_managed. Can you say more about >>>> how this affected spark-shell for
Re: Does anyone meet the issue that jars under lib_managed is never downloaded ?
BTW, After I revert SPARK-784, I can see all the jars under lib_managed/jars On Tue, Nov 17, 2015 at 2:46 PM, Jeff Zhang wrote: > Hi Josh, > > I notice the comments in https://github.com/apache/spark/pull/9575 said > that Datanucleus related jars will still be copied to lib_managed/jars. > But I don't see any jars under lib_managed/jars. The weird thing is that I > see the jars on another machine, but could not see jars on my laptop even > after I delete the whole spark project and start from scratch. Does it > related with environments ? I try to add the following code in > SparkBuild.scala to track the issue, it shows that the jars is empty. Any > thoughts on that ? > > > deployDatanucleusJars := { > val jars: Seq[File] = (fullClasspath in assembly).value.map(_.data) > .filter(_.getPath.contains("org.datanucleus")) > // this is what I added > println("*") > println("fullClasspath:"+fullClasspath) > println("assembly:"+assembly) > println("jars:"+jars.map(_.getAbsolutePath()).mkString(",")) > // > > > On Mon, Nov 16, 2015 at 4:51 PM, Jeff Zhang wrote: > >> This is the exception I got >> >> 15/11/16 16:50:48 WARN metastore.HiveMetaStore: Retrying creating default >> database after error: Class >> org.datanucleus.api.jdo.JDOPersistenceManagerFactory was not found. >> javax.jdo.JDOFatalUserException: Class >> org.datanucleus.api.jdo.JDOPersistenceManagerFactory was not found. >> at >> javax.jdo.JDOHelper.invokeGetPersistenceManagerFactoryOnImplementation(JDOHelper.java:1175) >> at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:808) >> at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:701) >> at >> org.apache.hadoop.hive.metastore.ObjectStore.getPMF(ObjectStore.java:365) >> at >> org.apache.hadoop.hive.metastore.ObjectStore.getPersistenceManager(ObjectStore.java:394) >> at >> org.apache.hadoop.hive.metastore.ObjectStore.initialize(ObjectStore.java:291) >> at >> org.apache.hadoop.hive.metastore.ObjectStore.setConf(ObjectStore.java:258) >> at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:73) >> at >> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133) >> at >> org.apache.hadoop.hive.metastore.RawStoreProxy.(RawStoreProxy.java:57) >> at >> org.apache.hadoop.hive.metastore.RawStoreProxy.getProxy(RawStoreProxy.java:66) >> at >> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.newRawStore(HiveMetaStore.java:593) >> at >> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.getMS(HiveMetaStore.java:571) >> at >> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.createDefaultDB(HiveMetaStore.java:620) >> at >> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.init(HiveMetaStore.java:461) >> at >> org.apache.hadoop.hive.metastore.RetryingHMSHandler.(RetryingHMSHandler.java:66) >> at >> org.apache.hadoop.hive.metastore.RetryingHMSHandler.getProxy(RetryingHMSHandler.java:72) >> at >> org.apache.hadoop.hive.metastore.HiveMetaStore.newRetryingHMSHandler(HiveMetaStore.java:5762) >> at >> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:199) >> at >> org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.(SessionHiveMetaStoreClient.java:74) >> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) >> at >> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) >> at >> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) >> at java.lang.reflect.Constructor.newInstance(Constructor.java:526) >> at >> org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1521) >> at >> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.(RetryingMetaStoreClient.java:86) >> >> On Mon, Nov 16, 2015 at 4:47 PM, Jeff Zhang wrote: >> >>> It's about the datanucleus related jars which is needed by spark sql. >>> Without these jars, I could not call data frame related api ( I make >>> HiveContext enabled) >>> >>> >>> >>> On Mon, Nov 16, 2015 at 4:10 PM, Josh Rosen >>> wrote: >>> >>>> As of https://github.com/apache/spark/pull/9575, Spark's build will no >>>> longer place every dependency JAR into lib_managed. Can you say more about >>>> how this affected spark-shell for you (m
Re: Does anyone meet the issue that jars under lib_managed is never downloaded ?
Hi Josh, I notice the comments in https://github.com/apache/spark/pull/9575 said that Datanucleus related jars will still be copied to lib_managed/jars. But I don't see any jars under lib_managed/jars. The weird thing is that I see the jars on another machine, but could not see jars on my laptop even after I delete the whole spark project and start from scratch. Does it related with environments ? I try to add the following code in SparkBuild.scala to track the issue, it shows that the jars is empty. Any thoughts on that ? deployDatanucleusJars := { val jars: Seq[File] = (fullClasspath in assembly).value.map(_.data) .filter(_.getPath.contains("org.datanucleus")) // this is what I added println("*") println("fullClasspath:"+fullClasspath) println("assembly:"+assembly) println("jars:"+jars.map(_.getAbsolutePath()).mkString(",")) // On Mon, Nov 16, 2015 at 4:51 PM, Jeff Zhang wrote: > This is the exception I got > > 15/11/16 16:50:48 WARN metastore.HiveMetaStore: Retrying creating default > database after error: Class > org.datanucleus.api.jdo.JDOPersistenceManagerFactory was not found. > javax.jdo.JDOFatalUserException: Class > org.datanucleus.api.jdo.JDOPersistenceManagerFactory was not found. > at > javax.jdo.JDOHelper.invokeGetPersistenceManagerFactoryOnImplementation(JDOHelper.java:1175) > at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:808) > at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:701) > at > org.apache.hadoop.hive.metastore.ObjectStore.getPMF(ObjectStore.java:365) > at > org.apache.hadoop.hive.metastore.ObjectStore.getPersistenceManager(ObjectStore.java:394) > at > org.apache.hadoop.hive.metastore.ObjectStore.initialize(ObjectStore.java:291) > at > org.apache.hadoop.hive.metastore.ObjectStore.setConf(ObjectStore.java:258) > at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:73) > at > org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133) > at > org.apache.hadoop.hive.metastore.RawStoreProxy.(RawStoreProxy.java:57) > at > org.apache.hadoop.hive.metastore.RawStoreProxy.getProxy(RawStoreProxy.java:66) > at > org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.newRawStore(HiveMetaStore.java:593) > at > org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.getMS(HiveMetaStore.java:571) > at > org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.createDefaultDB(HiveMetaStore.java:620) > at > org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.init(HiveMetaStore.java:461) > at > org.apache.hadoop.hive.metastore.RetryingHMSHandler.(RetryingHMSHandler.java:66) > at > org.apache.hadoop.hive.metastore.RetryingHMSHandler.getProxy(RetryingHMSHandler.java:72) > at > org.apache.hadoop.hive.metastore.HiveMetaStore.newRetryingHMSHandler(HiveMetaStore.java:5762) > at > org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:199) > at > org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.(SessionHiveMetaStoreClient.java:74) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:526) > at > org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1521) > at > org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.(RetryingMetaStoreClient.java:86) > > On Mon, Nov 16, 2015 at 4:47 PM, Jeff Zhang wrote: > >> It's about the datanucleus related jars which is needed by spark sql. >> Without these jars, I could not call data frame related api ( I make >> HiveContext enabled) >> >> >> >> On Mon, Nov 16, 2015 at 4:10 PM, Josh Rosen >> wrote: >> >>> As of https://github.com/apache/spark/pull/9575, Spark's build will no >>> longer place every dependency JAR into lib_managed. Can you say more about >>> how this affected spark-shell for you (maybe share a stacktrace)? >>> >>> On Mon, Nov 16, 2015 at 12:03 AM, Jeff Zhang wrote: >>> >>>> >>>> Sometimes, the jars under lib_managed is missing. And after I rebuild >>>> the spark, the jars under lib_managed is still not downloaded. This would >>>> cause the spark-shell fail due to jars missing. Anyone has hit this weird >>>> issue ? >>>> >>>> >>>> >>>> -- >>>> Best Regards >>>> >>>> Jeff Zhang >>>> >>> >>> >> >> >> -- >> Best Regards >> >> Jeff Zhang >> > > > > -- > Best Regards > > Jeff Zhang > -- Best Regards Jeff Zhang
Re: slightly more informative error message in MLUtils.loadLibSVMFile
+1 On Tue, Nov 17, 2015 at 7:43 AM, Joseph Bradley wrote: > That sounds useful; would you mind submitting a JIRA (and a PR if you're > willing)? > Thanks, > Joseph > > On Fri, Oct 23, 2015 at 12:43 PM, Robert Dodier > wrote: > >> Hi, >> >> MLUtils.loadLibSVMFile verifies that indices are 1-based and >> increasing, and otherwise triggers an error. I'd like to suggest that >> the error message be a little more informative. I ran into this when >> loading a malformed file. Exactly what gets printed isn't too crucial, >> maybe you would want to print something else, all that matters is to >> give some context so that the user can find the problem more quickly. >> >> Hope this helps in some way. >> >> Robert Dodier >> >> PS. >> >> diff --git >> a/mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala >> b/mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala >> index 81c2f0c..6f5f680 100644 >> --- a/mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala >> +++ b/mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala >> @@ -91,7 +91,7 @@ object MLUtils { >> val indicesLength = indices.length >> while (i < indicesLength) { >>val current = indices(i) >> - require(current > previous, "indices should be one-based >> and in ascending order" ) >> + require(current > previous, "indices should be one-based >> and in ascending order; found current=" + current + ", previous=" + >> previous + "; line=\"" + line + "\"" ) >> previous = current >>i += 1 >> } >> >> - >> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org >> For additional commands, e-mail: dev-h...@spark.apache.org >> >> > -- Best Regards Jeff Zhang
Re: Does anyone meet the issue that jars under lib_managed is never downloaded ?
This is the exception I got 15/11/16 16:50:48 WARN metastore.HiveMetaStore: Retrying creating default database after error: Class org.datanucleus.api.jdo.JDOPersistenceManagerFactory was not found. javax.jdo.JDOFatalUserException: Class org.datanucleus.api.jdo.JDOPersistenceManagerFactory was not found. at javax.jdo.JDOHelper.invokeGetPersistenceManagerFactoryOnImplementation(JDOHelper.java:1175) at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:808) at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:701) at org.apache.hadoop.hive.metastore.ObjectStore.getPMF(ObjectStore.java:365) at org.apache.hadoop.hive.metastore.ObjectStore.getPersistenceManager(ObjectStore.java:394) at org.apache.hadoop.hive.metastore.ObjectStore.initialize(ObjectStore.java:291) at org.apache.hadoop.hive.metastore.ObjectStore.setConf(ObjectStore.java:258) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:73) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133) at org.apache.hadoop.hive.metastore.RawStoreProxy.(RawStoreProxy.java:57) at org.apache.hadoop.hive.metastore.RawStoreProxy.getProxy(RawStoreProxy.java:66) at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.newRawStore(HiveMetaStore.java:593) at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.getMS(HiveMetaStore.java:571) at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.createDefaultDB(HiveMetaStore.java:620) at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.init(HiveMetaStore.java:461) at org.apache.hadoop.hive.metastore.RetryingHMSHandler.(RetryingHMSHandler.java:66) at org.apache.hadoop.hive.metastore.RetryingHMSHandler.getProxy(RetryingHMSHandler.java:72) at org.apache.hadoop.hive.metastore.HiveMetaStore.newRetryingHMSHandler(HiveMetaStore.java:5762) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:199) at org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.(SessionHiveMetaStoreClient.java:74) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1521) at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.(RetryingMetaStoreClient.java:86) On Mon, Nov 16, 2015 at 4:47 PM, Jeff Zhang wrote: > It's about the datanucleus related jars which is needed by spark sql. > Without these jars, I could not call data frame related api ( I make > HiveContext enabled) > > > > On Mon, Nov 16, 2015 at 4:10 PM, Josh Rosen > wrote: > >> As of https://github.com/apache/spark/pull/9575, Spark's build will no >> longer place every dependency JAR into lib_managed. Can you say more about >> how this affected spark-shell for you (maybe share a stacktrace)? >> >> On Mon, Nov 16, 2015 at 12:03 AM, Jeff Zhang wrote: >> >>> >>> Sometimes, the jars under lib_managed is missing. And after I rebuild >>> the spark, the jars under lib_managed is still not downloaded. This would >>> cause the spark-shell fail due to jars missing. Anyone has hit this weird >>> issue ? >>> >>> >>> >>> -- >>> Best Regards >>> >>> Jeff Zhang >>> >> >> > > > -- > Best Regards > > Jeff Zhang > -- Best Regards Jeff Zhang
Re: Does anyone meet the issue that jars under lib_managed is never downloaded ?
It's about the datanucleus related jars which is needed by spark sql. Without these jars, I could not call data frame related api ( I make HiveContext enabled) On Mon, Nov 16, 2015 at 4:10 PM, Josh Rosen wrote: > As of https://github.com/apache/spark/pull/9575, Spark's build will no > longer place every dependency JAR into lib_managed. Can you say more about > how this affected spark-shell for you (maybe share a stacktrace)? > > On Mon, Nov 16, 2015 at 12:03 AM, Jeff Zhang wrote: > >> >> Sometimes, the jars under lib_managed is missing. And after I rebuild the >> spark, the jars under lib_managed is still not downloaded. This would cause >> the spark-shell fail due to jars missing. Anyone has hit this weird issue ? >> >> >> >> -- >> Best Regards >> >> Jeff Zhang >> > > -- Best Regards Jeff Zhang
Does anyone meet the issue that jars under lib_managed is never downloaded ?
Sometimes, the jars under lib_managed is missing. And after I rebuild the spark, the jars under lib_managed is still not downloaded. This would cause the spark-shell fail due to jars missing. Anyone has hit this weird issue ? -- Best Regards Jeff Zhang
Re: Why there's no api for SparkContext#textFiles to support multiple inputs ?
Didn't notice that I can pass comma separated path in the existing API (SparkContext#textFile). So no necessary for new api. Thanks all. On Thu, Nov 12, 2015 at 10:24 AM, Jeff Zhang wrote: > Hi Pradeep > > ≥≥≥ Looks like what I was suggesting doesn't work. :/ > I guess you mean put comma separated path into one string and pass it > to existing API (SparkContext#textFile). It should not work. I suggest to > create new api SparkContext#textFiles to accept an array of string. I have > already implemented a simple patch and it works. > > > > > On Thu, Nov 12, 2015 at 10:17 AM, Pradeep Gollakota > wrote: > >> Looks like what I was suggesting doesn't work. :/ >> >> On Wed, Nov 11, 2015 at 4:49 PM, Jeff Zhang wrote: >> >>> Yes, that's what I suggest. TextInputFormat support multiple inputs. So >>> in spark side, we just need to provide API to for that. >>> >>> On Thu, Nov 12, 2015 at 8:45 AM, Pradeep Gollakota >> > wrote: >>> >>>> IIRC, TextInputFormat supports an input path that is a comma separated >>>> list. I haven't tried this, but I think you should just be able to do >>>> sc.textFile("file1,file2,...") >>>> >>>> On Wed, Nov 11, 2015 at 4:30 PM, Jeff Zhang wrote: >>>> >>>>> I know these workaround, but wouldn't it be more convenient and >>>>> straightforward to use SparkContext#textFiles ? >>>>> >>>>> On Thu, Nov 12, 2015 at 2:27 AM, Mark Hamstra >>>> > wrote: >>>>> >>>>>> For more than a small number of files, you'd be better off using >>>>>> SparkContext#union instead of RDD#union. That will avoid building up a >>>>>> lengthy lineage. >>>>>> >>>>>> On Wed, Nov 11, 2015 at 10:21 AM, Jakob Odersky >>>>>> wrote: >>>>>> >>>>>>> Hey Jeff, >>>>>>> Do you mean reading from multiple text files? In that case, as a >>>>>>> workaround, you can use the RDD#union() (or ++) method to concatenate >>>>>>> multiple rdds. For example: >>>>>>> >>>>>>> val lines1 = sc.textFile("file1") >>>>>>> val lines2 = sc.textFile("file2") >>>>>>> >>>>>>> val rdd = lines1 union lines2 >>>>>>> >>>>>>> regards, >>>>>>> --Jakob >>>>>>> >>>>>>> On 11 November 2015 at 01:20, Jeff Zhang wrote: >>>>>>> >>>>>>>> Although user can use the hdfs glob syntax to support multiple >>>>>>>> inputs. But sometimes, it is not convenient to do that. Not sure why >>>>>>>> there's no api of SparkContext#textFiles. It should be easy to >>>>>>>> implement >>>>>>>> that. I'd love to create a ticket and contribute for that if there's no >>>>>>>> other consideration that I don't know. >>>>>>>> >>>>>>>> -- >>>>>>>> Best Regards >>>>>>>> >>>>>>>> Jeff Zhang >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> Best Regards >>>>> >>>>> Jeff Zhang >>>>> >>>> >>>> >>> >>> >>> -- >>> Best Regards >>> >>> Jeff Zhang >>> >> >> > > > -- > Best Regards > > Jeff Zhang > -- Best Regards Jeff Zhang
Re: Why there's no api for SparkContext#textFiles to support multiple inputs ?
Hi Pradeep ≥≥≥ Looks like what I was suggesting doesn't work. :/ I guess you mean put comma separated path into one string and pass it to existing API (SparkContext#textFile). It should not work. I suggest to create new api SparkContext#textFiles to accept an array of string. I have already implemented a simple patch and it works. On Thu, Nov 12, 2015 at 10:17 AM, Pradeep Gollakota wrote: > Looks like what I was suggesting doesn't work. :/ > > On Wed, Nov 11, 2015 at 4:49 PM, Jeff Zhang wrote: > >> Yes, that's what I suggest. TextInputFormat support multiple inputs. So >> in spark side, we just need to provide API to for that. >> >> On Thu, Nov 12, 2015 at 8:45 AM, Pradeep Gollakota >> wrote: >> >>> IIRC, TextInputFormat supports an input path that is a comma separated >>> list. I haven't tried this, but I think you should just be able to do >>> sc.textFile("file1,file2,...") >>> >>> On Wed, Nov 11, 2015 at 4:30 PM, Jeff Zhang wrote: >>> >>>> I know these workaround, but wouldn't it be more convenient and >>>> straightforward to use SparkContext#textFiles ? >>>> >>>> On Thu, Nov 12, 2015 at 2:27 AM, Mark Hamstra >>>> wrote: >>>> >>>>> For more than a small number of files, you'd be better off using >>>>> SparkContext#union instead of RDD#union. That will avoid building up a >>>>> lengthy lineage. >>>>> >>>>> On Wed, Nov 11, 2015 at 10:21 AM, Jakob Odersky >>>>> wrote: >>>>> >>>>>> Hey Jeff, >>>>>> Do you mean reading from multiple text files? In that case, as a >>>>>> workaround, you can use the RDD#union() (or ++) method to concatenate >>>>>> multiple rdds. For example: >>>>>> >>>>>> val lines1 = sc.textFile("file1") >>>>>> val lines2 = sc.textFile("file2") >>>>>> >>>>>> val rdd = lines1 union lines2 >>>>>> >>>>>> regards, >>>>>> --Jakob >>>>>> >>>>>> On 11 November 2015 at 01:20, Jeff Zhang wrote: >>>>>> >>>>>>> Although user can use the hdfs glob syntax to support multiple >>>>>>> inputs. But sometimes, it is not convenient to do that. Not sure why >>>>>>> there's no api of SparkContext#textFiles. It should be easy to implement >>>>>>> that. I'd love to create a ticket and contribute for that if there's no >>>>>>> other consideration that I don't know. >>>>>>> >>>>>>> -- >>>>>>> Best Regards >>>>>>> >>>>>>> Jeff Zhang >>>>>>> >>>>>> >>>>>> >>>>> >>>> >>>> >>>> -- >>>> Best Regards >>>> >>>> Jeff Zhang >>>> >>> >>> >> >> >> -- >> Best Regards >> >> Jeff Zhang >> > > -- Best Regards Jeff Zhang
Re: Why there's no api for SparkContext#textFiles to support multiple inputs ?
Yes, that's what I suggest. TextInputFormat support multiple inputs. So in spark side, we just need to provide API to for that. On Thu, Nov 12, 2015 at 8:45 AM, Pradeep Gollakota wrote: > IIRC, TextInputFormat supports an input path that is a comma separated > list. I haven't tried this, but I think you should just be able to do > sc.textFile("file1,file2,...") > > On Wed, Nov 11, 2015 at 4:30 PM, Jeff Zhang wrote: > >> I know these workaround, but wouldn't it be more convenient and >> straightforward to use SparkContext#textFiles ? >> >> On Thu, Nov 12, 2015 at 2:27 AM, Mark Hamstra >> wrote: >> >>> For more than a small number of files, you'd be better off using >>> SparkContext#union instead of RDD#union. That will avoid building up a >>> lengthy lineage. >>> >>> On Wed, Nov 11, 2015 at 10:21 AM, Jakob Odersky >>> wrote: >>> >>>> Hey Jeff, >>>> Do you mean reading from multiple text files? In that case, as a >>>> workaround, you can use the RDD#union() (or ++) method to concatenate >>>> multiple rdds. For example: >>>> >>>> val lines1 = sc.textFile("file1") >>>> val lines2 = sc.textFile("file2") >>>> >>>> val rdd = lines1 union lines2 >>>> >>>> regards, >>>> --Jakob >>>> >>>> On 11 November 2015 at 01:20, Jeff Zhang wrote: >>>> >>>>> Although user can use the hdfs glob syntax to support multiple inputs. >>>>> But sometimes, it is not convenient to do that. Not sure why there's no >>>>> api >>>>> of SparkContext#textFiles. It should be easy to implement that. I'd love >>>>> to >>>>> create a ticket and contribute for that if there's no other consideration >>>>> that I don't know. >>>>> >>>>> -- >>>>> Best Regards >>>>> >>>>> Jeff Zhang >>>>> >>>> >>>> >>> >> >> >> -- >> Best Regards >> >> Jeff Zhang >> > > -- Best Regards Jeff Zhang
Re: Why there's no api for SparkContext#textFiles to support multiple inputs ?
I know these workaround, but wouldn't it be more convenient and straightforward to use SparkContext#textFiles ? On Thu, Nov 12, 2015 at 2:27 AM, Mark Hamstra wrote: > For more than a small number of files, you'd be better off using > SparkContext#union instead of RDD#union. That will avoid building up a > lengthy lineage. > > On Wed, Nov 11, 2015 at 10:21 AM, Jakob Odersky > wrote: > >> Hey Jeff, >> Do you mean reading from multiple text files? In that case, as a >> workaround, you can use the RDD#union() (or ++) method to concatenate >> multiple rdds. For example: >> >> val lines1 = sc.textFile("file1") >> val lines2 = sc.textFile("file2") >> >> val rdd = lines1 union lines2 >> >> regards, >> --Jakob >> >> On 11 November 2015 at 01:20, Jeff Zhang wrote: >> >>> Although user can use the hdfs glob syntax to support multiple inputs. >>> But sometimes, it is not convenient to do that. Not sure why there's no api >>> of SparkContext#textFiles. It should be easy to implement that. I'd love to >>> create a ticket and contribute for that if there's no other consideration >>> that I don't know. >>> >>> -- >>> Best Regards >>> >>> Jeff Zhang >>> >> >> > -- Best Regards Jeff Zhang
Why there's no api for SparkContext#textFiles to support multiple inputs ?
Although user can use the hdfs glob syntax to support multiple inputs. But sometimes, it is not convenient to do that. Not sure why there's no api of SparkContext#textFiles. It should be easy to implement that. I'd love to create a ticket and contribute for that if there's no other consideration that I don't know. -- Best Regards Jeff Zhang
Re: Why LibSVMRelation and CsvRelation don't extends HadoopFsRelation ?
Yes Kai, I also to plan to do for CsvRelation, will create PR for spark-csv On Wed, Nov 11, 2015 at 9:10 AM, Sasaki Kai wrote: > Did you indicate CsvRelation in spark-csv package? LibSVMRelation is > included in spark core package, but CsvRelation(spark-csv) is not. > Is it necessary for us to modify also spark-csv as you proposed in > SPARK-11622? > > Regards > > Kai > > > On Nov 5, 2015, at 11:30 AM, Jeff Zhang wrote: > > > > > > Not sure the reason, it seems LibSVMRelation and CsvRelation can > extends HadoopFsRelation and leverage the features from HadoopFsRelation. > Any other consideration for that ? > > > > > > -- > > Best Regards > > > > Jeff Zhang > > -- Best Regards Jeff Zhang
Re: Why LibSVMRelation and CsvRelation don't extends HadoopFsRelation ?
Thanks Hao. I have ready made it extends HadoopFsRelation and it works. Will create a jira for that. Besides that, I noticed that in DataSourceStrategy, spark build physical plan based on the trait of the BaseRelation in pattern matching (e.g. CatalystScan, TableScan, HadoopFsRelation). That means the order matters. I think it is risky because that means one BaseRelation can't extends more than 2 of these traits. And seems there's no place to restrict to extends more than 2 traits. Maybe needs to clean and reorganize these traits otherwise user may meets some weird issue when developing new DataSource. On Thu, Nov 5, 2015 at 1:16 PM, Cheng, Hao wrote: > Probably 2 reasons: > > 1. HadoopFsRelation was introduced since 1.4, but seems CsvRelation > was created based on 1.3 > > 2. HadoopFsRelation introduces the concept of Partition, which > probably not necessary for LibSVMRelation. > > > > But I think it will be easy to change as extending from HadoopFsRelation. > > > > Hao > > > > *From:* Jeff Zhang [mailto:zjf...@gmail.com] > *Sent:* Thursday, November 5, 2015 10:31 AM > *To:* dev@spark.apache.org > *Subject:* Why LibSVMRelation and CsvRelation don't extends > HadoopFsRelation ? > > > > > > Not sure the reason, it seems LibSVMRelation and CsvRelation can extends > HadoopFsRelation and leverage the features from HadoopFsRelation. Any > other consideration for that ? > > > > > > -- > > Best Regards > > Jeff Zhang > -- Best Regards Jeff Zhang
Why LibSVMRelation and CsvRelation don't extends HadoopFsRelation ?
Not sure the reason, it seems LibSVMRelation and CsvRelation can extends HadoopFsRelation and leverage the features from HadoopFsRelation. Any other consideration for that ? -- Best Regards Jeff Zhang
Re: Master build fails ?
I found it is due to SPARK-11073. Here's the command I used to build build/sbt clean compile -Pyarn -Phadoop-2.6 -Phive -Phive-thriftserver -Psparkr On Tue, Nov 3, 2015 at 7:52 PM, Jean-Baptiste Onofré wrote: > Hi Jeff, > > it works for me (with skipping the tests). > > Let me try again, just to be sure. > > Regards > JB > > > On 11/03/2015 11:50 AM, Jeff Zhang wrote: > >> Looks like it's due to guava version conflicts, I see both guava 14.0.1 >> and 16.0.1 under lib_managed/bundles. Anyone meet this issue too ? >> >> [error] >> >> /Users/jzhang/github/spark_apache/core/src/main/scala/org/apache/spark/SecurityManager.scala:26: >> object HashCodes is not a member of package com.google.common.hash >> [error] import com.google.common.hash.HashCodes >> [error]^ >> [info] Resolving org.apache.commons#commons-math;2.2 ... >> [error] >> >> /Users/jzhang/github/spark_apache/core/src/main/scala/org/apache/spark/SecurityManager.scala:384: >> not found: value HashCodes >> [error] val cookie = HashCodes.fromBytes(secret).toString() >> [error] ^ >> >> >> >> >> -- >> Best Regards >> >> Jeff Zhang >> > > -- > Jean-Baptiste Onofré > jbono...@apache.org > http://blog.nanthrax.net > Talend - http://www.talend.com > > - > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > For additional commands, e-mail: dev-h...@spark.apache.org > > -- Best Regards Jeff Zhang
Master build fails ?
Looks like it's due to guava version conflicts, I see both guava 14.0.1 and 16.0.1 under lib_managed/bundles. Anyone meet this issue too ? [error] /Users/jzhang/github/spark_apache/core/src/main/scala/org/apache/spark/SecurityManager.scala:26: object HashCodes is not a member of package com.google.common.hash [error] import com.google.common.hash.HashCodes [error]^ [info] Resolving org.apache.commons#commons-math;2.2 ... [error] /Users/jzhang/github/spark_apache/core/src/main/scala/org/apache/spark/SecurityManager.scala:384: not found: value HashCodes [error] val cookie = HashCodes.fromBytes(secret).toString() [error] ^ -- Best Regards Jeff Zhang
Should enforce the uniqueness of field name in DataFrame ?
Currently seems DataFrame doesn't enforce the uniqueness of field name. So it is possible to have same fields in DataFrame. It usually happens after join especially self-join. Although user can rename the column names before join, or rename the column names after join (DataFrame#withColunmRenamed is not sufficient for now). In hive, the ambiguous name can be resolved by using the table name as prefix, but seems DataFrame don't support it ( I mean DataFrame API rather than SparkSQL). I think we have 2 options here 1. Enforce the uniqueness of field name in DataFrame, so that the following operations would not cause ambiguous column reference 2. Provide DataFrame#withColunmsRenamed(oldColumns:Seq[String], newColumns:Seq[String]) to allow change schema names For now, I would prefer option 2 which is more easier to implement and keep compatibility. val df = ...// schema (name, age) val df2 = df.join(df, "name") // schema (name, age, age) df2.select("age") // ambiguous column reference. -- Best Regards Jeff Zhang
Re: Is OutputCommitCoordinator necessary for all the stages ?
Hi Josh, I mean on the driver side. OutputCommitCorrdinator.startStage is called in DAGScheduler#submitMissingTasks for all the stages (cost some memory). Although it is fine that as long as executor side don't call RPC, there's no much performance penalty. On Wed, Aug 12, 2015 at 12:17 AM, Josh Rosen wrote: > Can you clarify what you mean by "used for all stages"? > OutputCommitCoordinator RPCs should only be initiated through > SparkHadoopMapRedUtil.commitTask(), so while the OutputCommitCoordinator > doesn't make a distinction between ShuffleMapStages and ResultStages there > still should not be a performance penalty for this because the extra rounds > of RPCs should only be performed when necessary. > > > On 8/11/15 2:25 AM, Jeff Zhang wrote: > >> As my understanding, OutputCommitCoordinator should only be necessary for >> ResultStage (especially for ResultStage with hdfs write), but currently it >> is used for all the stages. Is there any reason for that ? >> >> -- >> Best Regards >> >> Jeff Zhang >> > > > - > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > For additional commands, e-mail: dev-h...@spark.apache.org > > -- Best Regards Jeff Zhang
Is OutputCommitCoordinator necessary for all the stages ?
As my understanding, OutputCommitCoordinator should only be necessary for ResultStage (especially for ResultStage with hdfs write), but currently it is used for all the stages. Is there any reason for that ? -- Best Regards Jeff Zhang