Re: Why spark-submit works with package not with jar

2024-05-05 Thread Jeff Zhang
t convinced why using the package should make so much > difference between a failure and success. In other words, when to use a > package rather than a jar. > > > Any ideas will be appreciated. > > > Thanks > > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > -- Best Regards Jeff Zhang

Re: Welcoming some new committers and PMC members

2019-09-09 Thread Jeff Zhang
>> >>>> >>>> -- >>>> Shane Knapp >>>> UC Berkeley EECS Research / RISELab Staff Technical Lead >>>> https://rise.cs.berkeley.edu >>>> >>>> - >>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>>> >>>> >>> >>> -- >>> John Zhuge >>> >> >> >> -- >> Name : Jungtaek Lim >> Blog : http://medium.com/@heartsavior >> Twitter : http://twitter.com/heartsavior >> LinkedIn : http://www.linkedin.com/in/heartsavior >> > -- Best Regards Jeff Zhang

Re: [ANNOUNCE] Announcing Apache Spark 2.2.3

2019-01-15 Thread Jeff Zhang
bers for contributing to > this release. This release would not have been possible without you. > > Bests, > Dongjoon. > -- Best Regards Jeff Zhang

Re: Kubernetes backend and docker images

2018-01-05 Thread Jeff Zhang
Awesome, less is better Mridul Muralidharan 于2018年1月6日周六 上午11:54写道: > > We should definitely clean this up and make it the default, nicely done > Marcelo ! > > Thanks, > Mridul > > On Fri, Jan 5, 2018 at 5:06 PM Marcelo Vanzin wrote: > >> Hey all,

Re: Faster Spark on ORC with Apache ORC

2017-07-13 Thread Jeff Zhang
Awesome, Dong Joon, It's a great improvement. Looking forward its merge. Dong Joon Hyun 于2017年7月12日周三 上午6:53写道: > Hi, All. > > > > Since Apache Spark 2.2 vote passed successfully last week, > > I think it’s a good time for me to ask your opinions again about the >

Re: [Important for PySpark Devs]: Master now tests with Python 2.7 rather than 2.6 - please retest any Python PRs

2017-03-31 Thread Jeff Zhang
Thanks, retrigger serveral pyspark PRs Hyukjin Kwon 于2017年3月30日周四 上午7:42写道: > Thank you for informing this. > > On 30 Mar 2017 3:52 a.m., "Holden Karau" wrote: > > Hi PySpark Developers, > > In https://issues.apache.org/jira/browse/SPARK-19955 / >

Re: welcoming Burak and Holden as committers

2017-01-24 Thread Jeff Zhang
Congratulations Burak and Holden! Yanbo Liang 于2017年1月25日周三 上午11:54写道: > Congratulations, Burak and Holden. > > On Tue, Jan 24, 2017 at 7:32 PM, Chester Chen > wrote: > > Congratulation to both. > > > > Holden, we need catch up. > > > > > > *Chester

Re: [VOTE] Release Apache Spark 1.6.3 (RC2)

2016-11-03 Thread Jeff Zhang
+1 Dongjoon Hyun 于2016年11月4日周五 上午9:44写道: > +1 (non-binding) > > It's built and tested on CentOS 6.8 / OpenJDK 1.8.0_111, too. > > Cheers, > Dongjoon. > > On 2016-11-03 14:30 (-0700), Davies Liu wrote: > > +1 > > > > On Wed, Nov 2, 2016 at 5:40 PM,

Re: [VOTE] Release Apache Spark 2.0.1 (RC4)

2016-09-29 Thread Jeff Zhang
>>>>> taking an existing Spark workload and running on this release candidate, >>>>> then reporting any regressions from 2.0.0. >>>>> >>>>> Q: What justifies a -1 vote for this release? >>>>> A: This is a maintenance release in the 2.0.x series. Bugs already >>>>> present in 2.0.0, missing features, or bugs related to new features will >>>>> not necessarily block this release. >>>>> >>>>> Q: What fix version should I use for patches merging into branch-2.0 >>>>> from now on? >>>>> A: Please mark the fix version as 2.0.2, rather than 2.0.1. If a new >>>>> RC (i.e. RC5) is cut, I will change the fix version of those patches to >>>>> 2.0.1. >>>>> >>>>> >>>>> >>>> >>>> >>>> -- >>>> Luciano Resende >>>> http://twitter.com/lresende1975 >>>> http://lresende.blogspot.com/ >>>> >>> >>> >> >> >> -- >> Kyle Kelley (@rgbkrk <https://twitter.com/rgbkrk>; lambdaops.com) >> > -- Best Regards Jeff Zhang

Re: [VOTE] Release Apache Spark 2.0.1 (RC3)

2016-09-26 Thread Jeff Zhang
res, digests, etc. can be found >>>>>>>> at: >>>>>>>> http://people.apache.org/~pwendell/spark-releases/spark-2.0. >>>>>>>> 1-rc3-bin/ >>>>>>>> >>>>>>>> Release artifacts are signed with the following key: >>>>>>>> https://people.apache.org/keys/committer/pwendell.asc >>>>>>>> >>>>>>>> The staging repository for this release can be found at: >>>>>>>> https://repository.apache.org/content/repositories/orgapache >>>>>>>> spark-1201/ >>>>>>>> >>>>>>>> The documentation corresponding to this release can be found at: >>>>>>>> http://people.apache.org/~pwendell/spark-releases/spark-2.0. >>>>>>>> 1-rc3-docs/ >>>>>>>> >>>>>>>> >>>>>>>> Q: How can I help test this release? >>>>>>>> A: If you are a Spark user, you can help us test this release by >>>>>>>> taking an existing Spark workload and running on this release >>>>>>>> candidate, >>>>>>>> then reporting any regressions from 2.0.0. >>>>>>>> >>>>>>>> Q: What justifies a -1 vote for this release? >>>>>>>> A: This is a maintenance release in the 2.0.x series. Bugs already >>>>>>>> present in 2.0.0, missing features, or bugs related to new features >>>>>>>> will >>>>>>>> not necessarily block this release. >>>>>>>> >>>>>>>> Q: What fix version should I use for patches merging into >>>>>>>> branch-2.0 from now on? >>>>>>>> A: Please mark the fix version as 2.0.2, rather than 2.0.1. If a >>>>>>>> new RC (i.e. RC4) is cut, I will change the fix version of those >>>>>>>> patches to >>>>>>>> 2.0.1. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>> >>>>> >>>> >>> >> > -- Best Regards Jeff Zhang

Re: Possible contribution to MLlib

2016-06-21 Thread Jeff Zhang
s, but > it has not been added to MLlib. Therefore, we are wondering if such an > extension of MLlib K-means algorithm would be appreciated by the community > and would have chances to get included in future spark releases. > > > > Regards, > > > > Simon Nanty > > > -- Best Regards Jeff Zhang

Re: [vote] Apache Spark 2.0.0-preview release (rc1)

2016-05-19 Thread Jeff Zhang
>>>> [INFO] Finished at: 2016-05-18T17:55:33+02:00 >>>>>>> [INFO] Final Memory: 90M/824M >>>>>>> [INFO] >>>>>>> >>>>>>> >>>>>>> On 18 May 2016, at 16:28, Sean Owen <so...@cloudera.com> wrote: >>>>>>> >>>>>>> I think it's a good idea. Although releases have been preceded before >>>>>>> by release candidates for developers, it would be good to get a >>>>>>> formal >>>>>>> preview/beta release ratified for public consumption ahead of a new >>>>>>> major release. Better to have a little more testing in the wild to >>>>>>> identify problems before 2.0.0 is finalized. >>>>>>> >>>>>>> +1 to the release. License, sigs, etc check out. On Ubuntu 16 + Java >>>>>>> 8, compilation and tests succeed for "-Pyarn -Phive >>>>>>> -Phive-thriftserver -Phadoop-2.6". >>>>>>> >>>>>>> On Wed, May 18, 2016 at 6:40 AM, Reynold Xin <r...@apache.org> >>>>>>> wrote: >>>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> In the past the Apache Spark community have created preview packages >>>>>>> (not >>>>>>> official releases) and used those as opportunities to ask community >>>>>>> members >>>>>>> to test the upcoming versions of Apache Spark. Several people in the >>>>>>> Apache >>>>>>> community have suggested we conduct votes for these preview packages >>>>>>> and >>>>>>> turn them into formal releases by the Apache foundation's standard. >>>>>>> Preview >>>>>>> releases are not meant to be functional, i.e. they can and highly >>>>>>> likely >>>>>>> will contain critical bugs or documentation errors, but we will be >>>>>>> able to >>>>>>> post them to the project's website to get wider feedback. They should >>>>>>> satisfy the legal requirements of Apache's release policy >>>>>>> (http://www.apache.org/dev/release.html) such as having proper >>>>>>> licenses. >>>>>>> >>>>>>> >>>>>>> Please vote on releasing the following candidate as Apache Spark >>>>>>> version >>>>>>> 2.0.0-preview. The vote is open until Friday, May 20, 2015 at 11:00 >>>>>>> PM PDT >>>>>>> and passes if a majority of at least 3 +1 PMC votes are cast. >>>>>>> >>>>>>> [ ] +1 Release this package as Apache Spark 2.0.0-preview >>>>>>> [ ] -1 Do not release this package because ... >>>>>>> >>>>>>> To learn more about Apache Spark, please see >>>>>>> http://spark.apache.org/ >>>>>>> >>>>>>> The tag to be voted on is 2.0.0-preview >>>>>>> (8f5a04b6299e3a47aca13cbb40e72344c0114860) >>>>>>> >>>>>>> The release files, including signatures, digests, etc. can be found >>>>>>> at: >>>>>>> >>>>>>> http://home.apache.org/~pwendell/spark-releases/spark-2.0.0-preview-bin/ >>>>>>> >>>>>>> Release artifacts are signed with the following key: >>>>>>> https://people.apache.org/keys/committer/pwendell.asc >>>>>>> >>>>>>> The documentation corresponding to this release can be found at: >>>>>>> >>>>>>> http://home.apache.org/~pwendell/spark-releases/spark-2.0.0-preview-docs/ >>>>>>> >>>>>>> The list of resolved issues are: >>>>>>> >>>>>>> https://issues.apache.org/jira/browse/SPARK-15351?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%202.0.0 >>>>>>> >>>>>>> >>>>>>> If you are a Spark user, you can help us test this release by taking >>>>>>> an >>>>>>> existing Apache Spark workload and running on this candidate, then >>>>>>> reporting >>>>>>> any regressions. >>>>>>> >>>>>>> >>>>>>> - >>>>>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org >>>>>>> For additional commands, e-mail: dev-h...@spark.apache.org >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>> >> > -- Best Regards Jeff Zhang

Re: executor delay in Spark

2016-04-24 Thread Jeff Zhang
> >>>>> >> If the data file is same then it should have similar distribution of >>>>> >> keys. >>>>> >> Few queries- >>>>> >> >>>>> >> 1. Did you compare the number of partitions in both the cases? >>>>> >> 2. Did you compare the resource allocation for Spark Shell vs Scala >>>>> >> Program being submitted? >>>>> >> >>>>> >> Also, can you please share the details of Spark Context, >>>>> Environment and >>>>> >> Executors when you run via Scala program? >>>>> >> >>>>> >> On Mon, Apr 18, 2016 at 4:41 AM, Raghava Mutharaju < >>>>> >> m.vijayaragh...@gmail.com> wrote: >>>>> >> >>>>> >>> Hello All, >>>>> >>> >>>>> >>> We are using HashPartitioner in the following way on a 3 node >>>>> cluster (1 >>>>> >>> master and 2 worker nodes). >>>>> >>> >>>>> >>> val u = >>>>> >>> sc.textFile("hdfs://x.x.x.x:8020/user/azureuser/s.txt").map[(Int, >>>>> >>> Int)](line => { line.split("\\|") match { case Array(x, y) => >>>>> (y.toInt, >>>>> >>> x.toInt) } }).partitionBy(new >>>>> HashPartitioner(8)).setName("u").persist() >>>>> >>> >>>>> >>> u.count() >>>>> >>> >>>>> >>> If we run this from the spark shell, the data (52 MB) is split >>>>> across >>>>> >>> the >>>>> >>> two worker nodes. But if we put this in a scala program and run >>>>> it, then >>>>> >>> all the data goes to only one node. We have run it multiple times, >>>>> but >>>>> >>> this >>>>> >>> behavior does not change. This seems strange. >>>>> >>> >>>>> >>> Is there some problem with the way we use HashPartitioner? >>>>> >>> >>>>> >>> Thanks in advance. >>>>> >>> >>>>> >>> Regards, >>>>> >>> Raghava. >>>>> >>> >>>>> >> >>>>> >> >>>>> > >>>>> > >>>>> > -- >>>>> > Regards, >>>>> > Raghava >>>>> > http://raghavam.github.io >>>>> > >>>>> >>>>> >>>>> -- >>>>> Thanks, >>>>> Mike >>>>> >>>> >>>> >>>> >>>> -- >>>> Regards, >>>> Raghava >>>> http://raghavam.github.io >>>> >>> >> >> >> -- >> Regards, >> Raghava >> http://raghavam.github.io >> > -- Best Regards Jeff Zhang

Spark build with scala-2.10 fails ?

2016-03-19 Thread Jeff Zhang
]^ [error] four errors found [error] Compile failed at Mar 17, 2016 2:45:22 PM [13.105s] -- Best Regards Jeff Zhang

Re: What should be spark.local.dir in spark on yarn?

2016-03-01 Thread Jeff Zhang
/data0X/yarn/nm used for usercache > > 16/03/01 08:41:12 INFO storage.DiskBlockManager: Created local directory at > /data01/yarn/nm/usercache/hadoop/appcache/application_1456776184284_0047/blockmgr-af5 > > > > On Mon, Feb 29, 2016 at 3:44 PM, Jeff Zhang <zjf...@gmail.com&

Re: Is spark.driver.maxResultSize used correctly ?

2016-03-01 Thread Jeff Zhang
the driver side? > > > On Sunday, February 28, 2016, Jeff Zhang <zjf...@gmail.com> wrote: > >> data skew might be possible, but not the common case. I think we should >> design for the common case, for the skew case, we may can set some >> parameter of fraction to allow

Re: Support virtualenv in PySpark

2016-03-01 Thread Jeff Zhang
you can already do what you proposed by creating > identical virtualenvs on all nodes on the same path and change the spark > python path to point to the virtualenv. > > Best Regards, > Mohannad > On Mar 1, 2016 06:07, "Jeff Zhang" <zjf...@gmail.com> wrote: >

Support virtualenv in PySpark

2016-02-29 Thread Jeff Zhang
.virtualenv.path (path to the executable for for virtualenv/conda) Best Regards Jeff Zhang

Re: What should be spark.local.dir in spark on yarn?

2016-02-29 Thread Jeff Zhang
park.local.dir to be /data01/tmp,/data02/tmp > > But spark master also writes some files to spark.local.dir > But my master box has only one additional disk /data01 > > So, what should I use for spark.local.dir the > spark.local.dir=/data01/tmp > or > spark.local.dir=/data01/tmp,/data02/tmp > > ? > -- Best Regards Jeff Zhang

Re: Control the stdout and stderr streams in a executor JVM

2016-02-28 Thread Jeff Zhang
e.interval > > But is there a possibility to have more fine grained control over these, > like we do in a log4j appender, with a property file? > > Rgds > -- > Niranda > @n1r44 <https://twitter.com/N1R44> > +94-71-554-8430 > https://pythagoreanscript.wordpress.com/ > -- Best Regards Jeff Zhang

Re: Is spark.driver.maxResultSize used correctly ?

2016-02-28 Thread Jeff Zhang
have skew and almost all the result data are in > one or a few tasks though. > > > On Friday, February 26, 2016, Jeff Zhang <zjf...@gmail.com> wrote: > >> >> My job get this exception very easily even when I set large value of >> spark.driver.maxRe

Is spark.driver.maxResultSize used correctly ?

2016-02-26 Thread Jeff Zhang
(1085.0 MB) is bigger than spark.driver.maxResultSize (1024.0 MB) -- Best Regards Jeff Zhang

Re: ORC file writing hangs in pyspark

2016-02-23 Thread Jeff Zhang
aged/raw_result, is created with a _temporary > folder, but the data is never written. The job hangs at this point, > apparently indefinitely. > > Additionally, no logs are recorded or available for the jobs on the > history server. > > What could be the problem? > -- Best Regards Jeff Zhang

Re: Are we running SparkR tests in Jenkins?

2016-01-15 Thread Jeff Zhang
Created https://issues.apache.org/jira/browse/SPARK-12846 On Fri, Jan 15, 2016 at 3:29 PM, Jeff Zhang <zjf...@gmail.com> wrote: > Right, I forget the documentation, will create a follow up jira. > > On Fri, Jan 15, 2016 at 3:23 PM, Shivaram Venkataraman < > shiva...@eec

Re: Are we running SparkR tests in Jenkins?

2016-01-15 Thread Jeff Zhang
gt; >>> Running R applications through 'sparkR' is not supported as of Spark > 2.0. > >>> Use ./bin/spark-submit > >> > >> > >> Are we still running R tests? Or just saying that this will be > deprecated? > >> > >> Kind regards, > >> > >> Herman van Hövell tot Westerflier > >> > > > > - > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > For additional commands, e-mail: dev-h...@spark.apache.org > > -- Best Regards Jeff Zhang

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Jeff Zhang
월 5일 (화) 오후 2:27, Julio Antonio Soto de Vicente < >>>>>>>>>>> ju...@esbet.es>님이 작성: >>>>>>>>>>> >>>>>>>>>>>> Unfortunately, Koert is right. >>>>>>>>>>>> >>>>>>>>>>>> I've been in a couple of projects using Spark (banking >>>>>>>>>>>> industry) where CentOS + Python 2.6 is the toolbox available. >>>>>>>>>>>> >>>>>>>>>>>> That said, I believe it should not be a concern for Spark. >>>>>>>>>>>> Python 2.6 is old and busted, which is totally opposite to the >>>>>>>>>>>> Spark >>>>>>>>>>>> philosophy IMO. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> El 5 ene 2016, a las 20:07, Koert Kuipers <ko...@tresata.com> >>>>>>>>>>>> escribió: >>>>>>>>>>>> >>>>>>>>>>>> rhel/centos 6 ships with python 2.6, doesnt it? >>>>>>>>>>>> >>>>>>>>>>>> if so, i still know plenty of large companies where python 2.6 >>>>>>>>>>>> is the only option. asking them for python 2.7 is not going to work >>>>>>>>>>>> >>>>>>>>>>>> so i think its a bad idea >>>>>>>>>>>> >>>>>>>>>>>> On Tue, Jan 5, 2016 at 1:52 PM, Juliet Hougland < >>>>>>>>>>>> juliet.hougl...@gmail.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> I don't see a reason Spark 2.0 would need to support Python >>>>>>>>>>>>> 2.6. At this point, Python 3 should be the default that is >>>>>>>>>>>>> encouraged. >>>>>>>>>>>>> Most organizations acknowledge the 2.7 is common, but lagging >>>>>>>>>>>>> behind the version they should theoretically use. Dropping python >>>>>>>>>>>>> 2.6 >>>>>>>>>>>>> support sounds very reasonable to me. >>>>>>>>>>>>> >>>>>>>>>>>>> On Tue, Jan 5, 2016 at 5:45 AM, Nicholas Chammas < >>>>>>>>>>>>> nicholas.cham...@gmail.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> +1 >>>>>>>>>>>>>> >>>>>>>>>>>>>> Red Hat supports Python 2.6 on REHL 5 until 2020 >>>>>>>>>>>>>> <https://alexgaynor.net/2015/mar/30/red-hat-open-source-community/>, >>>>>>>>>>>>>> but otherwise yes, Python 2.6 is ancient history and the core >>>>>>>>>>>>>> Python >>>>>>>>>>>>>> developers stopped supporting it in 2013. REHL 5 is not a good >>>>>>>>>>>>>> enough >>>>>>>>>>>>>> reason to continue support for Python 2.6 IMO. >>>>>>>>>>>>>> >>>>>>>>>>>>>> We should aim to support Python 2.7 and Python 3.3+ (which I >>>>>>>>>>>>>> believe we currently do). >>>>>>>>>>>>>> >>>>>>>>>>>>>> Nick >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Tue, Jan 5, 2016 at 8:01 AM Allen Zhang < >>>>>>>>>>>>>> allenzhang...@126.com> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> plus 1, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> we are currently using python 2.7.2 in production >>>>>>>>>>>>>>> environment. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> 在 2016-01-05 18:11:45,"Meethu Mathew" < >>>>>>>>>>>>>>> meethu.mat...@flytxt.com> 写道: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> +1 >>>>>>>>>>>>>>> We use Python 2.7 >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Regards, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Meethu Mathew >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Tue, Jan 5, 2016 at 12:47 PM, Reynold Xin < >>>>>>>>>>>>>>> r...@databricks.com> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Does anybody here care about us dropping support for Python >>>>>>>>>>>>>>>> 2.6 in Spark 2.0? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Python 2.6 is ancient, and is pretty slow in many aspects >>>>>>>>>>>>>>>> (e.g. json parsing) when compared with Python 2.7. Some >>>>>>>>>>>>>>>> libraries that >>>>>>>>>>>>>>>> Spark depend on stopped supporting 2.6. We can still convince >>>>>>>>>>>>>>>> the library >>>>>>>>>>>>>>>> maintainers to support 2.6, but it will be extra work. I'm >>>>>>>>>>>>>>>> curious if >>>>>>>>>>>>>>>> anybody still uses Python 2.6 to run Spark. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Thanks. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>> >>>> >>> >>> >> > -- Best Regards Jeff Zhang

How to execute non-hadoop command ?

2016-01-04 Thread Jeff Zhang
but don't find how it associates with hadoop -- Best Regards Jeff Zhang

Re: How to execute non-hadoop command ?

2016-01-04 Thread Jeff Zhang
Sorry, wrong list On Tue, Jan 5, 2016 at 12:36 PM, Jeff Zhang <zjf...@gmail.com> wrote: > I want to create service check for spark, but spark don't use hadoop > script as launch script. I found other component use ExecuteHadoop to > launch hadoop job to verify the service,

Re: 答复: How can I get the column data based on specific column name and then stored these data in array or list ?

2015-12-25 Thread Jeff Zhang
e)")) df2.printSchema() df2.show() On Fri, Dec 25, 2015 at 3:44 PM, zml张明磊 <mingleizh...@ctrip.com> wrote: > Thanks, Jeff. It’s not choose some columns of a Row. It’s just choose all > data in a column and convert it to an Array. Do you understand my mean ? > > > > In Chin

Re: [VOTE] Release Apache Spark 1.6.0 (RC4)

2015-12-22 Thread Jeff Zhang
algorithms. >> >> Changes of behavior >> >>- spark.mllib.tree.GradientBoostedTrees validationTol has changed >>semantics in 1.6. Previously, it was a threshold for absolute change in >>error. Now, it resembles the behavior of GradientDescent convergenceTol: >>For large errors, it uses relative error (relative to the previous error); >>for small errors (< 0.01), it uses absolute error. >>- spark.ml.feature.RegexTokenizer: Previously, it did not convert >>strings to lowercase before tokenizing. Now, it converts to lowercase by >>default, with an option not to. This matches the behavior of the simpler >>Tokenizer transformer. >>- Spark SQL's partition discovery has been changed to only discover >>partition directories that are children of the given path. (i.e. if >>path="/my/data/x=1" then x=1 will no longer be considered a partition >>but only children of x=1.) This behavior can be overridden by >>manually specifying the basePath that partitioning discovery should >>start with (SPARK-11678 >><https://issues.apache.org/jira/browse/SPARK-11678>). >>- When casting a value of an integral type to timestamp (e.g. casting >>a long value to timestamp), the value is treated as being in seconds >>instead of milliseconds (SPARK-11724 >><https://issues.apache.org/jira/browse/SPARK-11724>). >>- With the improved query planner for queries having distinct >>aggregations (SPARK-9241 >><https://issues.apache.org/jira/browse/SPARK-9241>), the plan of a >>query having a single distinct aggregation has been changed to a more >>robust version. To switch back to the plan generated by Spark 1.5's >>planner, please set spark.sql.specializeSingleDistinctAggPlanning to >>true (SPARK-12077 <https://issues.apache.org/jira/browse/SPARK-12077> >>). >> >> > -- Best Regards Jeff Zhang

Re: [VOTE] Release Apache Spark 1.6.0 (RC3)

2015-12-19 Thread Jeff Zhang
a threshold for absolute change in >>error. Now, it resembles the behavior of GradientDescent convergenceTol: >>For large errors, it uses relative error (relative to the previous error); >>for small errors (< 0.01), it uses absolute error. >>- spark.ml.feature.RegexTokenizer: Previously, it did not convert >>strings to lowercase before tokenizing. Now, it converts to lowercase by >>default, with an option not to. This matches the behavior of the simpler >>Tokenizer transformer. >>- Spark SQL's partition discovery has been changed to only discover >>partition directories that are children of the given path. (i.e. if >>path="/my/data/x=1" then x=1 will no longer be considered a partition >>but only children of x=1.) This behavior can be overridden by >>manually specifying the basePath that partitioning discovery should >>start with (SPARK-11678 >><https://issues.apache.org/jira/browse/SPARK-11678>). >>- When casting a value of an integral type to timestamp (e.g. casting >>a long value to timestamp), the value is treated as being in seconds >>instead of milliseconds (SPARK-11724 >><https://issues.apache.org/jira/browse/SPARK-11724>). >>- With the improved query planner for queries having distinct >>aggregations (SPARK-9241 >><https://issues.apache.org/jira/browse/SPARK-9241>), the plan of a >>query having a single distinct aggregation has been changed to a more >>robust version. To switch back to the plan generated by Spark 1.5's >>planner, please set spark.sql.specializeSingleDistinctAggPlanning to >>true (SPARK-12077 <https://issues.apache.org/jira/browse/SPARK-12077> >>). >> >> > > > -- > Luciano Resende > http://people.apache.org/~lresende > http://twitter.com/lresende1975 > http://lresende.blogspot.com/ > -- Best Regards Jeff Zhang

Re: [SparkR] Any reason why saveDF's mode is append by default ?

2015-12-14 Thread Jeff Zhang
iginal PR [1]) but the Python API seems to have been > changed to match Scala / Java in > https://issues.apache.org/jira/browse/SPARK-6366 > > Feel free to open a JIRA / PR for this. > > Thanks > Shivaram > > [1] https://github.com/amplab-extras/SparkR-pkg/pull/199/files > >

[SparkR] Any reason why saveDF's mode is append by default ?

2015-12-13 Thread Jeff Zhang
It is inconsistent with scala api which is error by default. Any reason for that ? Thanks -- Best Regards Jeff Zhang

Re: Spark doesn't unset HADOOP_CONF_DIR when testing ?

2015-12-06 Thread Jeff Zhang
Thanks Josh, created https://issues.apache.org/jira/browse/SPARK-12166 On Mon, Dec 7, 2015 at 4:32 AM, Josh Rosen <joshro...@databricks.com> wrote: > I agree that we should unset this in our tests. Want to file a JIRA and > submit a PR to do this? > > On Thu, Dec 3, 2015 at

Spark doesn't unset HADOOP_CONF_DIR when testing ?

2015-12-03 Thread Jeff Zhang
) [info] at org.apache.spark.sql.hive.client.ClientWrapper.(ClientWrapper.scala:171) [info] at org.apache.spark.sql.hive.HiveContext.executionHive$lzycompute(HiveContext.scala:162) [info] at org.apache.spark.sql.hive.HiveContext.executionHive(HiveContext.scala:160) -- Best Regards Jeff

Re: Problem in running MLlib SVM

2015-11-28 Thread Jeff Zhang
;Integer, Integer, > Integer>() > { > public Integer call(Integer arg0, Integer arg1) > throws Exception { > return arg0+arg1; > }}); > > //compute accuracy as the percentage of the correctly classified > examples > double accuracy=((double)sum)/((double)classification.count()); > System.out.println("Accuracy = " + accuracy); > > } > } > ); > } > } > -- Best Regards Jeff Zhang

Re: Problem in running MLlib SVM

2015-11-28 Thread Jeff Zhang
ore is more than 0 and the > label is positive, then I return 1 which is correct classification and I > return zero otherwise. Do you have any idea how to classify a point as > positive or negative using this score or another function ? > > On Sat, Nov 28, 2015 at 5:14 AM, Jeff Zhang <

Re: FW: SequenceFile and object reuse

2015-11-18 Thread Jeff Zhang
as suggested by the above? What format > did you use? > > thanks > Jeff > -- Best Regards Jeff Zhang

Re: Does anyone meet the issue that jars under lib_managed is never downloaded ?

2015-11-17 Thread Jeff Zhang
Sure, hive profile is enabled. On Wed, Nov 18, 2015 at 6:12 AM, Josh Rosen <joshro...@databricks.com> wrote: > Is the Hive profile enabled? I think it may need to be turned on in order > for those JARs to be deployed. > > On Tue, Nov 17, 2015 at 2:27 AM Jeff Zhang <zj

Re: Does anyone meet the issue that jars under lib_managed is never downloaded ?

2015-11-17 Thread Jeff Zhang
Created https://issues.apache.org/jira/browse/SPARK-11798 On Wed, Nov 18, 2015 at 9:42 AM, Josh Rosen <joshro...@databricks.com> wrote: > Can you file a JIRA issue to help me triage this further? Thanks! > > On Tue, Nov 17, 2015 at 4:08 PM Jeff Zhang <zjf...@gmail.com> w

Re: Does anyone meet the issue that jars under lib_managed is never downloaded ?

2015-11-17 Thread Jeff Zhang
BTW, After I revert SPARK-7841, I can see all the jars under lib_managed/jars On Tue, Nov 17, 2015 at 2:46 PM, Jeff Zhang <zjf...@gmail.com> wrote: > Hi Josh, > > I notice the comments in https://github.com/apache/spark/pull/9575 said > that Datanucleus related jars wi

Re: slightly more informative error message in MLUtils.loadLibSVMFile

2015-11-16 Thread Jeff Zhang
e=\"" + line + "\"" ) >> previous = current >>i += 1 >> } >> >> - >> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org >> For additional commands, e-mail: dev-h...@spark.apache.org >> >> > -- Best Regards Jeff Zhang

Re: Does anyone meet the issue that jars under lib_managed is never downloaded ?

2015-11-16 Thread Jeff Zhang
pull/9575, Spark's build will no > longer place every dependency JAR into lib_managed. Can you say more about > how this affected spark-shell for you (maybe share a stacktrace)? > > On Mon, Nov 16, 2015 at 12:03 AM, Jeff Zhang <zjf...@gmail.com> wrote: > >> >> Someti

Re: Does anyone meet the issue that jars under lib_managed is never downloaded ?

2015-11-16 Thread Jeff Zhang
) at org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1521) at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.(RetryingMetaStoreClient.java:86) On Mon, Nov 16, 2015 at 4:47 PM, Jeff Zhang <zjf...@gmail.com> wrote: > It's about the datanucleus related jars which is needed by

Does anyone meet the issue that jars under lib_managed is never downloaded ?

2015-11-16 Thread Jeff Zhang
Sometimes, the jars under lib_managed is missing. And after I rebuild the spark, the jars under lib_managed is still not downloaded. This would cause the spark-shell fail due to jars missing. Anyone has hit this weird issue ? -- Best Regards Jeff Zhang

Re: Does anyone meet the issue that jars under lib_managed is never downloaded ?

2015-11-16 Thread Jeff Zhang
assembly) println("jars:"+jars.map(_.getAbsolutePath()).mkString(",")) // On Mon, Nov 16, 2015 at 4:51 PM, Jeff Zhang <zjf...@gmail.com> wrote: > This is the exception I got > > 15/11/16 16:50:48 WARN metas

Re: Does anyone meet the issue that jars under lib_managed is never downloaded ?

2015-11-16 Thread Jeff Zhang
BTW, After I revert SPARK-784, I can see all the jars under lib_managed/jars On Tue, Nov 17, 2015 at 2:46 PM, Jeff Zhang <zjf...@gmail.com> wrote: > Hi Josh, > > I notice the comments in https://github.com/apache/spark/pull/9575 said > that Datanucleus related jars wi

Re: Why there's no api for SparkContext#textFiles to support multiple inputs ?

2015-11-12 Thread Jeff Zhang
Didn't notice that I can pass comma separated path in the existing API (SparkContext#textFile). So no necessary for new api. Thanks all. On Thu, Nov 12, 2015 at 10:24 AM, Jeff Zhang <zjf...@gmail.com> wrote: > Hi Pradeep > > ≥≥≥ Looks like what I was suggesting doesn't work. :

Why there's no api for SparkContext#textFiles to support multiple inputs ?

2015-11-11 Thread Jeff Zhang
that I don't know. -- Best Regards Jeff Zhang

Re: Why there's no api for SparkContext#textFiles to support multiple inputs ?

2015-11-11 Thread Jeff Zhang
implemented a simple patch and it works. On Thu, Nov 12, 2015 at 10:17 AM, Pradeep Gollakota <pradeep...@gmail.com> wrote: > Looks like what I was suggesting doesn't work. :/ > > On Wed, Nov 11, 2015 at 4:49 PM, Jeff Zhang <zjf...@gmail.com> wrote: > >> Yes, that's w

Re: Why there's no api for SparkContext#textFiles to support multiple inputs ?

2015-11-11 Thread Jeff Zhang
n use the RDD#union() (or ++) method to concatenate >> multiple rdds. For example: >> >> val lines1 = sc.textFile("file1") >> val lines2 = sc.textFile("file2") >> >> val rdd = lines1 union lines2 >> >> regards, >> --Jakob >>

Re: Why there's no api for SparkContext#textFiles to support multiple inputs ?

2015-11-11 Thread Jeff Zhang
> list. I haven't tried this, but I think you should just be able to do > sc.textFile("file1,file2,...") > > On Wed, Nov 11, 2015 at 4:30 PM, Jeff Zhang <zjf...@gmail.com> wrote: > >> I know these workaround, but wouldn't it be more convenient and >> straight

Re: Why LibSVMRelation and CsvRelation don't extends HadoopFsRelation ?

2015-11-10 Thread Jeff Zhang
on(spark-csv) is not. > Is it necessary for us to modify also spark-csv as you proposed in > SPARK-11622? > > Regards > > Kai > > > On Nov 5, 2015, at 11:30 AM, Jeff Zhang <zjf...@gmail.com> wrote: > > > > > > Not sure the reason, it seems LibSVMRelation and CsvRelatio

Why LibSVMRelation and CsvRelation don't extends HadoopFsRelation ?

2015-11-04 Thread Jeff Zhang
Not sure the reason, it seems LibSVMRelation and CsvRelation can extends HadoopFsRelation and leverage the features from HadoopFsRelation. Any other consideration for that ? -- Best Regards Jeff Zhang

Re: Why LibSVMRelation and CsvRelation don't extends HadoopFsRelation ?

2015-11-04 Thread Jeff Zhang
tition, which > probably not necessary for LibSVMRelation. > > > > But I think it will be easy to change as extending from HadoopFsRelation. > > > > Hao > > > > *From:* Jeff Zhang [mailto:zjf...@gmail.com] > *Sent:* Thursday, November 5, 2015 10:

Is OutputCommitCoordinator necessary for all the stages ?

2015-08-11 Thread Jeff Zhang
As my understanding, OutputCommitCoordinator should only be necessary for ResultStage (especially for ResultStage with hdfs write), but currently it is used for all the stages. Is there any reason for that ? -- Best Regards Jeff Zhang

Re: Is OutputCommitCoordinator necessary for all the stages ?

2015-08-11 Thread Jeff Zhang
there still should not be a performance penalty for this because the extra rounds of RPCs should only be performed when necessary. On 8/11/15 2:25 AM, Jeff Zhang wrote: As my understanding, OutputCommitCoordinator should only be necessary for ResultStage (especially for ResultStage with hdfs