Re: [VOTE] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-30 Thread Ye Xianjin
+1 Sent from my iPhoneOn Apr 30, 2024, at 3:23 PM, DB Tsai  wrote:+1 On Apr 29, 2024, at 8:01 PM, Wenchen Fan  wrote:To add more color:Spark data source table and Hive Serde table are both stored in the Hive metastore and keep the data files in the table directory. The only difference is they have different "table provider", which means Spark will use different reader/writer. Ideally the Spark native data source reader/writer is faster than the Hive Serde ones.What's more, the default format of Hive Serde is text. I don't think people want to use text format tables in production. Most people will add `STORED AS parquet` or `USING parquet` explicitly. By setting this config to false, we have a more reasonable default behavior: creating Parquet tables (or whatever is specified by `spark.sql.sources.default`).On Tue, Apr 30, 2024 at 10:45 AM Wenchen Fan  wrote:@Mich Talebzadeh there seems to be a misunderstanding here. The Spark native data source table is still stored in the Hive metastore, it's just that Spark will use a different (and faster) reader/writer for it. `hive-site.xml` should work as it is today.On Tue, Apr 30, 2024 at 5:23 AM Hyukjin Kwon  wrote:+1It's a legacy conf that we should eventually remove it away. Spark should create Spark table by default, not Hive table.Mich, for your workload, you can simply switch that conf off if it concerns you. We also enabled ANSI as well (that you agreed on). It's a bit akwakrd to stop in the middle for this compatibility reason during making Spark sound. The compatibility has been tested in production for a long time so I don't see any particular issue about the compatibility case you mentioned.On Mon, Apr 29, 2024 at 2:08 AM Mich Talebzadeh  wrote:Hi @Wenchen Fan Thanks for your response. I believe we have not had enough time to "DISCUSS" this matter. Currently in order to make Spark take advantage of Hive, I create a soft link in $SPARK_HOME/conf. FYI, my spark version is 3.4.0 and Hive is 3.1.1 /opt/spark/conf/hive-site.xml -> /data6/hduser/hive-3.1.1/conf/hive-site.xmlThis works fine for me in my lab. So in the future if we opt to use the setting "spark.sql.legacy.createHiveTableByDefault" to False, there will not be a need for this logical link.? On the face of it, this looks fine but in real life it may require a number of changes to the old scripts. Hence my concern. As a matter of interest has anyone liaised with the Hive team to ensure they have introduced the additional changes you outlined?HTHMich Talebzadeh,Technologist | Architect | Data Engineer  | Generative AI | FinCrimeLondonUnited Kingdom

   view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh

 Disclaimer: The information provided is correct to the best of my knowledge but of course cannot be guaranteed . It is essential to note that, as with any advice, quote "one test result is worth one-thousand expert opinions (Werner Von Braun)".

On Sun, 28 Apr 2024 at 09:34, Wenchen Fan  wrote:@Mich Talebzadeh thanks for sharing your concern!Note: creating Spark native data source tables is usually Hive compatible as well, unless we use features that Hive does not support (TIMESTAMP NTZ, ANSI INTERVAL, etc.). I think it's a better default to create Spark native table in this case, instead of creating Hive table and fail.On Sat, Apr 27, 2024 at 12:46 PM Cheng Pan  wrote:+1 (non-binding)

Thanks,
Cheng Pan

On Sat, Apr 27, 2024 at 9:29 AM Holden Karau  wrote:
>
> +1
>
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>
>
> On Fri, Apr 26, 2024 at 12:06 PM L. C. Hsieh  wrote:
>>
>> +1
>>
>> On Fri, Apr 26, 2024 at 10:01 AM Dongjoon Hyun  wrote:
>> >
>> > I'll start with my +1.
>> >
>> > Dongjoon.
>> >
>> > On 2024/04/26 16:45:51 Dongjoon Hyun wrote:
>> > > Please vote on SPARK-46122 to set spark.sql.legacy.createHiveTableByDefault
>> > > to `false` by default. The technical scope is defined in the following PR.
>> > >
>> > > - DISCUSSION:
>> > > https://lists.apache.org/thread/ylk96fg4lvn6klxhj6t6yh42lyqb8wmd
>> > > - JIRA: https://issues.apache.org/jira/browse/SPARK-46122
>> > > - PR: https://github.com/apache/spark/pull/46207
>> > >
>> > > The vote is open until April 30th 1AM (PST) and passes
>> > > if a majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>> > >
>> > > [ ] +1 Set spark.sql.legacy.createHiveTableByDefault to false by default
>> > > [ ] -1 Do not change spark.sql.legacy.createHiveTableByDefault because ...
>> > >
>> > > Thank you in advance.
>> > >
>> > > Dongjoon
>> > >
>> >
>> > -
>> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>> >
>>
>> 

Re: Welcome two new Apache Spark committers

2023-08-06 Thread Ye Xianjin
Congratulations!Sent from my iPhoneOn Aug 7, 2023, at 11:16 AM, Yuming Wang  wrote:Congratulations!On Mon, Aug 7, 2023 at 11:11 AM Kent Yao  wrote:Congrats! Peter and Xiduo!

Cheng Pan  于2023年8月7日周一 11:01写道:
>
> Congratulations! Peter and Xiduo!
>
> Thanks,
> Cheng Pan
>
>
> > On Aug 7, 2023, at 10:58, Gengliang Wang  wrote:
> >
> > Congratulations! Peter and Xiduo!
>
>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org




Re: Random expr in join key not support

2021-10-19 Thread Ye Xianjin
> For that, you can add a table subquery and do it in the select list.

Do you mean something like this:
select * from t1 join (select floor(random()*9) + id as x from t2) m on t1.id = 
m.x ?

Yes, that works. But that raise another question: theses two queries seem 
semantically equivalent, yet we treat them differently: one raises an analysis 
exception, one can work well. 
Should we treat them equally?




Sent from my iPhone

> On Oct 20, 2021, at 9:55 AM, Yingyi Bu  wrote:
> 
> 
> Per SQL spec, I think your join query can only be run as a NestedLoopJoin or 
> CartesianProduct.  See page 241 in SQL-99 
> (http://web.cecs.pdx.edu/~len/sql1999.pdf).
> In other words, it might be a correctness bug in other systems if they run 
> your query as a hash join.
> 
> > Here the purpose of adding a random in join key is to resolve the data skew 
> > problem.
> 
> For that, you can add a table subquery and do it in the select list.
> 
> Best,
> Yingyi
> 
> 
>> On Tue, Oct 19, 2021 at 12:46 AM Lantao Jin  wrote:
>> In PostgreSQL and Presto, the below query works well
>> sql> create table t1 (id int);
>> sql> create table t2 (id int);
>> sql> select * from t1 join t2 on t1.id = floor(random() * 9) + t2.id;
>> 
>> But it throws "Error in query: nondeterministic expressions are only allowed 
>> in Project, Filter, Aggregate or Window". Why Spark doesn't support random 
>> expressions in join condition?
>> Here the purpose to add a random in join key is to resolve the data skew 
>> problem.
>> 
>> Thanks,
>> Lantao


Re: [DISCUSS] SPIP: FunctionCatalog

2021-02-15 Thread Ye Xianjin
Hi,

Thanks for Ryan and Wenchen for leading this.

I’d like to add my two cents here. In production environments, the function 
catalog might be used by multiple systems, such as Spark, Presto and Hive.  Is 
it possible that this function catalog is designed with as an unified function 
catalog in mind, or at least it wouldn’t that difficult to extend this catalog 
as an unified one. 

P.S. We registered a lot of UDFs in hive HMS in our production environment, and 
those UDFs are shared by Spark and Presto. It works well even through with a 
lot of drawbacks.

Sent from my iPhone

> On Feb 16, 2021, at 2:44 AM, Ryan Blue  wrote:
> 
> 
> Thanks for the positive feedback, everyone. It sounds like there is a clear 
> path forward for calling functions. Even without a prototype, the `invoke` 
> plans show that Wenchen's suggested optimization can be done, and 
> incorporating it as an optional extension to this proposal solves many of the 
> unknowns.
> 
> With that area now understood, is there any discussion about other parts of 
> the proposal, besides the function call interface?
> 
>> On Fri, Feb 12, 2021 at 10:40 PM Chao Sun  wrote:
>> This is an important feature which can unblock several other projects 
>> including bucket join support for DataSource v2, complete support for 
>> enforcing DataSource v2 distribution requirements on the write path, etc. I 
>> like Ryan's proposals which look simple and elegant, with nice support on 
>> function overloading and variadic arguments. On the other hand, I think 
>> Wenchen made a very good point about performance. Overall, I'm excited to 
>> see active discussions on this topic and believe the community will come to 
>> a proposal with the best of both sides.
>> 
>> Chao
>> 
>>> On Fri, Feb 12, 2021 at 7:58 PM Hyukjin Kwon  wrote:
>>> +1 for Liang-chi's.
>>> 
>>> Thanks Ryan and Wenchen for leading this.
>>> 
>>> 
>>> 2021년 2월 13일 (토) 오후 12:18, Liang-Chi Hsieh 님이 작성:
 Basically I think the proposal makes sense to me and I'd like to support 
 the
 SPIP as it looks like we have strong need for the important feature.
 
 Thanks Ryan for working on this and I do also look forward to Wenchen's
 implementation. Thanks for the discussion too.
 
 Actually I think the SupportsInvoke proposed by Ryan looks a good
 alternative to me. Besides Wenchen's alternative implementation, is there a
 chance we also have the SupportsInvoke for comparison?
 
 
 John Zhuge wrote
 > Excited to see our Spark community rallying behind this important 
 > feature!
 > 
 > The proposal lays a solid foundation of minimal feature set with careful
 > considerations for future optimizations and extensions. Can't wait to see
 > it leading to more advanced functionalities like views with shared custom
 > functions, function pushdown, lambda, etc. It has already borne fruit 
 > from
 > the constructive collaborations in this thread. Looking forward to
 > Wenchen's prototype and further discussions including the SupportsInvoke
 > extension proposed by Ryan.
 > 
 > 
 > On Fri, Feb 12, 2021 at 4:35 PM Owen O'Malley 
 
 > owen.omalley@
 
 > 
 > wrote:
 > 
 >> I think this proposal is a very good thing giving Spark a standard way 
 >> of
 >> getting to and calling UDFs.
 >>
 >> I like having the ScalarFunction as the API to call the UDFs. It is
 >> simple, yet covers all of the polymorphic type cases well. I think it
 >> would
 >> also simplify using the functions in other contexts like pushing down
 >> filters into the ORC & Parquet readers although there are a lot of
 >> details
 >> that would need to be considered there.
 >>
 >> .. Owen
 >>
 >>
 >> On Fri, Feb 12, 2021 at 11:07 PM Erik Krogen 
 
 > ekrogen@.com
 
 > 
 >> wrote:
 >>
 >>> I agree that there is a strong need for a FunctionCatalog within Spark
 >>> to
 >>> provide support for shareable UDFs, as well as make movement towards
 >>> more
 >>> advanced functionality like views which themselves depend on UDFs, so I
 >>> support this SPIP wholeheartedly.
 >>>
 >>> I find both of the proposed UDF APIs to be sufficiently user-friendly
 >>> and
 >>> extensible. I generally think Wenchen's proposal is easier for a user 
 >>> to
 >>> work with in the common case, but has greater potential for confusing
 >>> and
 >>> hard-to-debug behavior due to use of reflective method signature
 >>> searches.
 >>> The merits on both sides can hopefully be more properly examined with
 >>> code,
 >>> so I look forward to seeing an implementation of Wenchen's ideas to
 >>> provide
 >>> a more concrete comparison. I am optimistic that we will not let the
 >>> debate
 >>> over this point unreasonably stall the SPIP from making progress.
 >>>
 >>> Thank 

Re: Welcoming some new committers and PMC members

2019-09-09 Thread Ye Xianjin
Congratulations!

Sent from my iPhone

> On Sep 10, 2019, at 9:19 AM, Jeff Zhang  wrote:
> 
> Congratulations!
> 
> Saisai Shao  于2019年9月10日周二 上午9:16写道:
>> Congratulations!
>> 
>> Jungtaek Lim  于2019年9月9日周一 下午6:11写道:
>>> Congratulations! Well deserved!
>>> 
 On Tue, Sep 10, 2019 at 9:51 AM John Zhuge  wrote:
 Congratulations!
 
> On Mon, Sep 9, 2019 at 5:45 PM Shane Knapp  wrote:
> congrats everyone!  :)
> 
> On Mon, Sep 9, 2019 at 5:32 PM Matei Zaharia  
> wrote:
> >
> > Hi all,
> >
> > The Spark PMC recently voted to add several new committers and one PMC 
> > member. Join me in welcoming them to their new roles!
> >
> > New PMC member: Dongjoon Hyun
> >
> > New committers: Ryan Blue, Liang-Chi Hsieh, Gengliang Wang, Yuming 
> > Wang, Weichen Xu, Ruifeng Zheng
> >
> > The new committers cover lots of important areas including ML, SQL, and 
> > data sources, so it’s great to have them here. All the best,
> >
> > Matei and the Spark PMC
> >
> >
> > -
> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >
> 
> 
> -- 
> Shane Knapp
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
> 
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> 
 
 
 -- 
 John Zhuge
>>> 
>>> 
>>> -- 
>>> Name : Jungtaek Lim
>>> Blog : http://medium.com/@heartsavior
>>> Twitter : http://twitter.com/heartsavior
>>> LinkedIn : http://www.linkedin.com/in/heartsavior
> 
> 
> -- 
> Best Regards
> 
> Jeff Zhang


Re: Welcoming three new committers

2015-02-03 Thread Ye Xianjin
Congratulations!

-- 
Ye Xianjin
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)


On Wednesday, February 4, 2015 at 6:34 AM, Matei Zaharia wrote:

 Hi all,
 
 The PMC recently voted to add three new committers: Cheng Lian, Joseph 
 Bradley and Sean Owen. All three have been major contributors to Spark in the 
 past year: Cheng on Spark SQL, Joseph on MLlib, and Sean on ML and many 
 pieces throughout Spark Core. Join me in welcoming them as committers!
 
 Matei
 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org 
 (mailto:dev-unsubscr...@spark.apache.org)
 For additional commands, e-mail: dev-h...@spark.apache.org 
 (mailto:dev-h...@spark.apache.org)
 
 




Re: [VOTE] Release Apache Spark 1.2.1 (RC2)

2015-01-28 Thread Ye Xianjin
Sean,  
the MQRRStreamSuite is also failed for me on Mac OS X, Though I don’t have time 
to invest that.

--  
Ye Xianjin
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)


On Wednesday, January 28, 2015 at 9:17 PM, Sean Owen wrote:

 +1 (nonbinding). I verified that all the hash / signing items I
 mentioned before are resolved.
  
 The source package compiles on Ubuntu / Java 8. I ran tests and the
 passed. Well, actually I see the same failure I've seeing locally on
 OS X and on Ubuntu for a while, but I think nobody else has seen this?
  
 MQTTStreamSuite:
 - mqtt input stream *** FAILED ***
 org.eclipse.paho.client.mqttv3.MqttException: Too many publishes in progress
 at 
 org.eclipse.paho.client.mqttv3.internal.ClientState.send(ClientState.java:423)
  
 Doesn't happen on Jenkins. If nobody else is seeing this, I suspect it
 is something perhaps related to my env that I haven't figured out yet,
 so should not be considered a blocker.
  
 On Wed, Jan 28, 2015 at 10:06 AM, Patrick Wendell pwend...@gmail.com 
 (mailto:pwend...@gmail.com) wrote:
  Please vote on releasing the following candidate as Apache Spark version 
  1.2.1!
   
  The tag to be voted on is v1.2.1-rc1 (commit b77f876):
  https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=b77f87673d1f9f03d4c83cf583158227c551359b
   
  The release files, including signatures, digests, etc. can be found at:
  http://people.apache.org/~pwendell/spark-1.2.1-rc2/
   
  Release artifacts are signed with the following key:
  https://people.apache.org/keys/committer/pwendell.asc
   
  The staging repository for this release can be found at:
  https://repository.apache.org/content/repositories/orgapachespark-1062/
   
  The documentation corresponding to this release can be found at:
  http://people.apache.org/~pwendell/spark-1.2.1-rc2-docs/
   
  Changes from rc1:
  This has no code changes from RC1. Only minor changes to the release script.
   
  Please vote on releasing this package as Apache Spark 1.2.1!
   
  The vote is open until Saturday, January 31, at 10:04 UTC and passes
  if a majority of at least 3 +1 PMC votes are cast.
   
  [ ] +1 Release this package as Apache Spark 1.2.1
  [ ] -1 Do not release this package because ...
   
  For a list of fixes in this release, see http://s.apache.org/Mpn.
   
  To learn more about Apache Spark, please see
  http://spark.apache.org/
   
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org 
  (mailto:dev-unsubscr...@spark.apache.org)
  For additional commands, e-mail: dev-h...@spark.apache.org 
  (mailto:dev-h...@spark.apache.org)
   
  
  
 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org 
 (mailto:dev-unsubscr...@spark.apache.org)
 For additional commands, e-mail: dev-h...@spark.apache.org 
 (mailto:dev-h...@spark.apache.org)
  
  




Re: spark_classpath in core/pom.xml and yarn/porm.xml

2014-09-25 Thread Ye Xianjin
hi, Sandy Ryza:
 I believe It's you originally added the SPARK_CLASSPATH in core/pom.xml in 
the org.scalatest section. Does this still needed in 1.1?
 I noticed this setting because when I looked into the unit-tests.log, It 
shows something below:
 14/09/24 23:57:19.246 WARN SparkConf:
 SPARK_CLASSPATH was detected (set to 'null').
 This is deprecated in Spark 1.0+.
 
 Please instead use:
  - ./spark-submit with --driver-class-path to augment the driver classpath
  - spark.executor.extraClassPath to augment the executor classpath
 
 14/09/24 23:57:19.246 WARN SparkConf: Setting 'spark.executor.extraClassPath' 
 to 'null' as a work-around.
 14/09/24 23:57:19.247 WARN SparkConf: Setting 'spark.driver.extraClassPath' 
 to 'null' as a work-around.

However I didn't set SPARK_CLASSPATH env variable. And looked into the 
SparkConf.scala, If user actually set extraClassPath,  the SparkConf will throw 
SparkException.
-- 
Ye Xianjin
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)


On Tuesday, September 23, 2014 at 12:56 AM, Ye Xianjin wrote:

 Hi:
 I notice the scalatest-maven-plugin set SPARK_CLASSPATH environment 
 variable for testing. But in the SparkConf.scala, this is deprecated in Spark 
 1.0+.
 So what this variable for? should we just remove this variable?
 
 
 -- 
 Ye Xianjin
 Sent with Sparrow (http://www.sparrowmailapp.com/?sig)
 



Re: spark_classpath in core/pom.xml and yarn/pom.xml

2014-09-25 Thread Ye Xianjin
Hi Sandy, 

Sorry for the bothering. 

The tests run ok even the SPARK_CLASS setting is there now, but It gives a 
config warning and will potential interfere other settings like Marcelo said. 
The warning goes away if I remove it out.

And Marcelo, I believe the setting in core/pom should not be used any more. But 
I don't think it's worthy to file a JIRA for such small change. Maybe put it 
into other related JIRA. It's a pity that your pr
already got merged.
 

-- 
Ye Xianjin
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)


On Friday, September 26, 2014 at 6:29 AM, Sandy Ryza wrote:

 Hi Ye,
 
 I think git blame shows me because I fixed the formatting in core/pom.xml, 
 but I don't actually know the original reason for setting SPARK_CLASSPATH 
 there.
 
 Do the tests run OK if you take it out?
 
 -Sandy
 
 
 On Thu, Sep 25, 2014 at 1:59 AM, Ye Xianjin advance...@gmail.com 
 (mailto:advance...@gmail.com) wrote:
  hi, Sandy Ryza:
   I believe It's you originally added the SPARK_CLASSPATH in 
  core/pom.xml in the org.scalatest section. Does this still needed in 1.1?
   I noticed this setting because when I looked into the unit-tests.log, 
  It shows something below:
   14/09/24 23:57:19.246 WARN SparkConf:
   SPARK_CLASSPATH was detected (set to 'null').
   This is deprecated in Spark 1.0+.
  
   Please instead use:
- ./spark-submit with --driver-class-path to augment the driver classpath
- spark.executor.extraClassPath to augment the executor classpath
  
   14/09/24 23:57:19.246 WARN SparkConf: Setting 
   'spark.executor.extraClassPath' to 'null' as a work-around.
   14/09/24 23:57:19.247 WARN SparkConf: Setting 
   'spark.driver.extraClassPath' to 'null' as a work-around.
  
  However I didn't set SPARK_CLASSPATH env variable. And looked into the 
  SparkConf.scala, If user actually set extraClassPath,  the SparkConf will 
  throw SparkException.
  --
  Ye Xianjin
  Sent with Sparrow (http://www.sparrowmailapp.com/?sig)
  
  
  On Tuesday, September 23, 2014 at 12:56 AM, Ye Xianjin wrote:
  
   Hi:
   I notice the scalatest-maven-plugin set SPARK_CLASSPATH environment 
   variable for testing. But in the SparkConf.scala, this is deprecated in 
   Spark 1.0+.
   So what this variable for? should we just remove this variable?
  
  
   --
   Ye Xianjin
   Sent with Sparrow (http://www.sparrowmailapp.com/?sig)
  
  
 



spark_classpath in core/pom.xml and yarn/porm.xml

2014-09-22 Thread Ye Xianjin
Hi:
I notice the scalatest-maven-plugin set SPARK_CLASSPATH environment 
variable for testing. But in the SparkConf.scala, this is deprecated in Spark 
1.0+.
So what this variable for? should we just remove this variable?


-- 
Ye Xianjin
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)



Re: about spark assembly jar

2014-09-02 Thread Ye Xianjin
Sorry, The quick reply didn't cc the dev list.

Sean, sometimes I have to use the spark-shell to confirm some behavior change. 
In that case, I have to reassembly the whole project.  is there another way 
around, not use the the big jar in development? For the original question, I 
have no comments. 

-- 
Ye Xianjin
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)


On Tuesday, September 2, 2014 at 4:58 PM, Sean Owen wrote:

 No, usually you unit-test your changes during development. That
 doesn't require the assembly. Eventually you may wish to test some
 change against the complete assembly.
 
 But that's a different question; I thought you were suggesting that
 the assembly JAR should never be created.
 
 On Tue, Sep 2, 2014 at 9:53 AM, Ye Xianjin advance...@gmail.com 
 (mailto:advance...@gmail.com) wrote:
  Hi, Sean:
  In development, do I really need to reassembly the whole project even if I
  only change a line or two code in one component?
  I used to that but found time-consuming.
  
  --
  Ye Xianjin
  Sent with Sparrow
  
  On Tuesday, September 2, 2014 at 4:45 PM, Sean Owen wrote:
  
  Hm, are you suggesting that the Spark distribution be a bag of 100
  JARs? It doesn't quite seem reasonable. It does not remove version
  conflicts, just pushes them to run-time, which isn't good. The
  assembly is also necessary because that's where shading happens. In
  development, you want to run against exactly what will be used in a
  real Spark distro.
  
  On Tue, Sep 2, 2014 at 9:39 AM, scwf wangf...@huawei.com 
  (mailto:wangf...@huawei.com) wrote:
  
  hi, all
  I suggest spark not use assembly jar as default run-time
  dependency(spark-submit/spark-class depend on assembly jar),use a library of
  all 3rd dependency jar like hadoop/hive/hbase more reasonable.
  
  1 assembly jar packaged all 3rd jars into a big one, so we need rebuild
  this jar if we want to update the version of some component(such as hadoop)
  2 in our practice with spark, sometimes we meet jar compatibility issue,
  it is hard to diagnose compatibility issue with assembly jar
  
  
  
  
  
  
  
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org 
  (mailto:dev-unsubscr...@spark.apache.org)
  For additional commands, e-mail: dev-h...@spark.apache.org 
  (mailto:dev-h...@spark.apache.org)
  
  
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org 
  (mailto:dev-unsubscr...@spark.apache.org)
  For additional commands, e-mail: dev-h...@spark.apache.org 
  (mailto:dev-h...@spark.apache.org)
  
 
 
 




Re: [VOTE] Release Apache Spark 1.1.0 (RC2)

2014-08-29 Thread Ye Xianjin
We just used CDH 4.7 for our production cluster. And I believe we won't use CDH 
5 in the next year.

Sent from my iPhone

 On 2014年8月29日, at 14:39, Matei Zaharia matei.zaha...@gmail.com wrote:
 
 Personally I'd actually consider putting CDH4 back if there are still users 
 on it. It's always better to be inclusive, and the convenience of a one-click 
 download is high. Do we have a sense on what % of CDH users still use CDH4?
 
 Matei
 
 On August 28, 2014 at 11:31:13 PM, Sean Owen (so...@cloudera.com) wrote:
 
 (Copying my reply since I don't know if it goes to the mailing list) 
 
 Great, thanks for explaining the reasoning. You're saying these aren't 
 going into the final release? I think that moots any issue surrounding 
 distributing them then. 
 
 This is all I know of from the ASF: 
 https://community.apache.org/projectIndependence.html I don't read it 
 as expressly forbidding this kind of thing although you can see how it 
 bumps up against the spirit. There's not a bright line -- what about 
 Tomcat providing binaries compiled for Windows for example? does that 
 favor an OS vendor? 
 
 From this technical ASF perspective only the releases matter -- do 
 what you want with snapshots and RCs. The only issue there is maybe 
 releasing something different than was in the RC; is that at all 
 confusing? Just needs a note. 
 
 I think this theoretical issue doesn't exist if these binaries aren't 
 released, so I see no reason to not proceed. 
 
 The rest is a different question about whether you want to spend time 
 maintaining this profile and candidate. The vendor already manages 
 their build I think and -- and I don't know -- may even prefer not to 
 have a different special build floating around. There's also the 
 theoretical argument that this turns off other vendors from adopting 
 Spark if it's perceived to be too connected to other vendors. I'd like 
 to maximize Spark's distribution and there's some argument you do this 
 by not making vendor profiles. But as I say a different question to 
 just think about over time... 
 
 (oh and PS for my part I think it's a good thing that CDH4 binaries 
 were removed. I wasn't arguing for resurrecting them) 
 
 On Fri, Aug 29, 2014 at 7:26 AM, Patrick Wendell pwend...@gmail.com wrote: 
 Hey Sean, 
 
 The reason there are no longer CDH-specific builds is that all newer 
 versions of CDH and HDP work with builds for the upstream Hadoop 
 projects. I dropped CDH4 in favor of a newer Hadoop version (2.4) and 
 the Hadoop-without-Hive (also 2.4) build. 
 
 For MapR - we can't officially post those artifacts on ASF web space 
 when we make the final release, we can only link to them as being 
 hosted by MapR specifically since they use non-compatible licenses. 
 However, I felt that providing these during a testing period was 
 alright, with the goal of increasing test coverage. I couldn't find 
 any policy against posting these on personal web space during RC 
 voting. However, we can remove them if there is one. 
 
 Dropping CDH4 was more because it is now pretty old, but we can add it 
 back if people want. The binary packaging is a slightly separate 
 question from release votes, so I can always add more binary packages 
 whenever. And on this, my main concern is covering the most popular 
 Hadoop versions to lower the bar for users to build and test Spark. 
 
 - Patrick 
 
 On Thu, Aug 28, 2014 at 11:04 PM, Sean Owen so...@cloudera.com wrote: 
 +1 I tested the source and Hadoop 2.4 release. Checksums and 
 signatures are OK. Compiles fine with Java 8 on OS X. Tests... don't 
 fail any more than usual. 
 
 FWIW I've also been using the 1.1.0-SNAPSHOT for some time in another 
 project and have encountered no problems. 
 
 
 I notice that the 1.1.0 release removes the CDH4-specific build, but 
 adds two MapR-specific builds. Compare with 
 https://dist.apache.org/repos/dist/release/spark/spark-1.0.2/ I 
 commented on the commit: 
 https://github.com/apache/spark/commit/ceb19830b88486faa87ff41e18d03ede713a73cc
  
 
 I'm in favor of removing all vendor-specific builds. This change 
 *looks* a bit funny as there was no JIRA (?) and appears to swap one 
 vendor for another. Of course there's nothing untoward going on, but 
 what was the reasoning? It's best avoided, and MapR already 
 distributes Spark just fine, no? 
 
 This is a gray area with ASF projects. I mention it as well because it 
 came up with Apache Flink recently 
 (http://mail-archives.eu.apache.org/mod_mbox/incubator-flink-dev/201408.mbox/%3CCANC1h_u%3DN0YKFu3pDaEVYz5ZcQtjQnXEjQA2ReKmoS%2Bye7%3Do%3DA%40mail.gmail.com%3E)
 Another vendor rightly noted this could look like favoritism. They 
 changed to remove vendor releases. 
 
 On Fri, Aug 29, 2014 at 3:14 AM, Patrick Wendell pwend...@gmail.com 
 wrote: 
 Please vote on releasing the following candidate as Apache Spark version 
 1.1.0! 
 
 The tag to be voted on is v1.1.0-rc2 (commit 711aebb3): 
 

Re: Tests failed after assembling the latest code from github

2014-04-14 Thread Ye Xianjin
Thank you for your reply. 

After building the assembly jar, the repl test still failed. The error output 
is same as I post before. 

-- 
Ye Xianjin
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)


On Tuesday, April 15, 2014 at 1:39 AM, Michael Armbrust wrote:

 I believe you may need an assembly jar to run the ReplSuite. sbt/sbt
 assembly/assembly.
 
 Michael
 
 
 On Mon, Apr 14, 2014 at 3:14 AM, Ye Xianjin advance...@gmail.com 
 (mailto:advance...@gmail.com) wrote:
 
  Hi, everyone:
  I am new to Spark development. I download spark's latest code from github.
  After running sbt/sbt assembly,
  I began running sbt/sbt test in the spark source code dir. But it failed
  running the repl module test.
  
  Here are some output details.
  
  command:
  sbt/sbt test-only org.apache.spark.repl.*
  output:
  
  [info] Loading project definition from
  /Volumes/MacintoshHD/github/spark/project/project
  [info] Loading project definition from
  /Volumes/MacintoshHD/github/spark/project
  [info] Set current project to root (in build
  file:/Volumes/MacintoshHD/github/spark/)
  [info] Passed: Total 0, Failed 0, Errors 0, Passed 0
  [info] No tests to run for graphx/test:testOnly
  [info] Passed: Total 0, Failed 0, Errors 0, Passed 0
  [info] No tests to run for bagel/test:testOnly
  [info] Passed: Total 0, Failed 0, Errors 0, Passed 0
  [info] No tests to run for streaming/test:testOnly
  [info] Passed: Total 0, Failed 0, Errors 0, Passed 0
  [info] No tests to run for mllib/test:testOnly
  [info] Passed: Total 0, Failed 0, Errors 0, Passed 0
  [info] No tests to run for catalyst/test:testOnly
  [info] Passed: Total 0, Failed 0, Errors 0, Passed 0
  [info] No tests to run for core/test:testOnly
  [info] Passed: Total 0, Failed 0, Errors 0, Passed 0
  [info] No tests to run for assembly/test:testOnly
  [info] Passed: Total 0, Failed 0, Errors 0, Passed 0
  [info] No tests to run for sql/test:testOnly
  [info] ExecutorClassLoaderSuite:
  2014-04-14 16:59:31.247 java[8393:1003] Unable to load realm info from
  SCDynamicStore
  [info] - child first *** FAILED *** (440 milliseconds)
  [info] java.lang.ClassNotFoundException: ReplFakeClass2
  [info] at java.lang.ClassLoader.findClass(ClassLoader.java:364)
  [info] at
  org.apache.spark.util.ParentClassLoader.findClass(ParentClassLoader.scala:26)
  [info] at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
  [info] at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
  [info] at
  org.apache.spark.util.ParentClassLoader.loadClass(ParentClassLoader.scala:30)
  [info] at
  org.apache.spark.repl.ExecutorClassLoader$$anonfun$findClass$1.apply(ExecutorClassLoader.scala:57)
  [info] at
  org.apache.spark.repl.ExecutorClassLoader$$anonfun$findClass$1.apply(ExecutorClassLoader.scala:57)
  [info] at scala.Option.getOrElse(Option.scala:120)
  [info] at
  org.apache.spark.repl.ExecutorClassLoader.findClass(ExecutorClassLoader.scala:57)
  [info] at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
  [info] at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
  [info] at
  org.apache.spark.repl.ExecutorClassLoaderSuite$$anonfun$1.apply$mcV$sp(ExecutorClassLoaderSuite.scala:47)
  [info] at
  org.apache.spark.repl.ExecutorClassLoaderSuite$$anonfun$1.apply(ExecutorClassLoaderSuite.scala:44)
  [info] at
  org.apache.spark.repl.ExecutorClassLoaderSuite$$anonfun$1.apply(ExecutorClassLoaderSuite.scala:44)
  [info] at org.scalatest.FunSuite$$anon$1.apply(FunSuite.scala:1265)
  [info] at org.scalatest.Suite$class.withFixture(Suite.scala:1974)
  [info] at
  org.apache.spark.repl.ExecutorClassLoaderSuite.withFixture(ExecutorClassLoaderSuite.scala:30)
  [info] at
  org.scalatest.FunSuite$class.invokeWithFixture$1(FunSuite.scala:1262)
  [info] at
  org.scalatest.FunSuite$$anonfun$runTest$1.apply(FunSuite.scala:1271)
  [info] at
  org.scalatest.FunSuite$$anonfun$runTest$1.apply(FunSuite.scala:1271)
  [info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:198)
  [info] at org.scalatest.FunSuite$class.runTest(FunSuite.scala:1271)
  [info] at
  org.apache.spark.repl.ExecutorClassLoaderSuite.runTest(ExecutorClassLoaderSuite.scala:30)
  [info] at
  org.scalatest.FunSuite$$anonfun$runTests$1.apply(FunSuite.scala:1304)
  [info] at
  org.scalatest.FunSuite$$anonfun$runTests$1.apply(FunSuite.scala:1304)
  [info] at
  org.scalatest.SuperEngine$$anonfun$org$scalatest$SuperEngine$$runTestsInBranch$1.apply(Engine.scala:260)
  [info] at
  org.scalatest.SuperEngine$$anonfun$org$scalatest$SuperEngine$$runTestsInBranch$1.apply(Engine.scala:249)
  [info] at scala.collection.immutable.List.foreach(List.scala:318)
  [info] at org.scalatest.SuperEngine.org 
  (http://org.scalatest.SuperEngine.org)
  $scalatest$SuperEngine$$runTestsInBranch(Engine.scala:249)
  [info] at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:326)
  [info] at org.scalatest.FunSuite$class.runTests(FunSuite.scala:1304)
  [info] at
  org.apache.spark.repl.ExecutorClassLoaderSuite.runTests

Re: Tests failed after assembling the latest code from github

2014-04-14 Thread Ye Xianjin
Hi, I think I have found the cause of the tests failing. 

I have two disks on my laptop. The spark project dir is on an HDD disk while 
the tempdir created by google.io.Files.createTempDir is the 
/var/folders/5q/ ,which is on the system disk, an SSD.
The ExecutorLoaderSuite test uses 
org.apache.spark.TestUtils.createdCompiledClass methods.
The createCompiledClass method first generates the compiled class in the 
pwd(spark/repl), thens use renameTo to move
the file. The renameTo method fails because the dest file is in a different 
filesystem than the source file.

I modify the TestUtils.scala to first copy the file to dest then delete the 
original file. The tests go smoothly.
Should I issue an jira about this problem? Then I can send a pr on Github.

-- 
Ye Xianjin
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)


On Tuesday, April 15, 2014 at 3:43 AM, Ye Xianjin wrote:

 well. This is very strange. 
 I looked into ExecutorClassLoaderSuite.scala and ReplSuite.scala and made 
 small changes to ExecutorClassLoaderSuite.scala (mostly output some internal 
 variables). After that, when running repl test, I noticed the ReplSuite  
 was tested first and the test result is ok. But the ExecutorClassLoaderSuite 
 test was weird.
 Here is the output:
 [info] ExecutorClassLoaderSuite:
 [error] Uncaught exception when running 
 org.apache.spark.repl.ExecutorClassLoaderSuite: java.lang.OutOfMemoryError: 
 PermGen space
 [error] Uncaught exception when running 
 org.apache.spark.repl.ExecutorClassLoaderSuite: java.lang.OutOfMemoryError: 
 PermGen space
 Internal error when running tests: java.lang.OutOfMemoryError: PermGen space
 Exception in thread Thread-3 java.io.EOFException
 at 
 java.io.ObjectInputStream$BlockDataInputStream.peekByte(ObjectInputStream.java:2577)
 at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1297)
 at java.io.ObjectInputStream.readArray(ObjectInputStream.java:1685)
 at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1323)
 at java.io.ObjectInputStream.readObject(ObjectInputStream.java:349)
 at sbt.React.react(ForkTests.scala:116)
 at sbt.ForkTests$$anonfun$mainTestTask$1$Acceptor$2$.run(ForkTests.scala:75)
 at java.lang.Thread.run(Thread.java:695)
 
 
 I revert my changes. The test result is same.
 
  I touched the ReplSuite.scala file (use touch command), the test order is 
 reversed, same as the very beginning. And the output is also the same.(The 
 result in my first post).
 
 
 -- 
 Ye Xianjin
 Sent with Sparrow (http://www.sparrowmailapp.com/?sig)
 
 
 On Tuesday, April 15, 2014 at 3:14 AM, Aaron Davidson wrote:
 
  This may have something to do with running the tests on a Mac, as there is
  a lot of File/URI/URL stuff going on in that test which may just have
  happened to work if run on a Linux system (like Jenkins). Note that this
  suite was added relatively recently:
  https://github.com/apache/spark/pull/217
  
  
  On Mon, Apr 14, 2014 at 12:04 PM, Ye Xianjin advance...@gmail.com 
  (mailto:advance...@gmail.com) wrote:
  
   Thank you for your reply.
   
   After building the assembly jar, the repl test still failed. The error
   output is same as I post before.
   
   --
   Ye Xianjin
   Sent with Sparrow (http://www.sparrowmailapp.com/?sig)
   
   
   On Tuesday, April 15, 2014 at 1:39 AM, Michael Armbrust wrote:
   
I believe you may need an assembly jar to run the ReplSuite. sbt/sbt
assembly/assembly.

Michael


On Mon, Apr 14, 2014 at 3:14 AM, Ye Xianjin advance...@gmail.com 
(mailto:advance...@gmail.com)(mailto:
   advance...@gmail.com (mailto:advance...@gmail.com)) wrote:

 Hi, everyone:
 I am new to Spark development. I download spark's latest code from
 


   
   github.
 After running sbt/sbt assembly,
 I began running sbt/sbt test in the spark source code dir. But it
 

   
   failed
 running the repl module test.
 
 Here are some output details.
 
 command:
 sbt/sbt test-only org.apache.spark.repl.*
 output:
 
 [info] Loading project definition from
 /Volumes/MacintoshHD/github/spark/project/project
 [info] Loading project definition from
 /Volumes/MacintoshHD/github/spark/project
 [info] Set current project to root (in build
 file:/Volumes/MacintoshHD/github/spark/)
 [info] Passed: Total 0, Failed 0, Errors 0, Passed 0
 [info] No tests to run for graphx/test:testOnly
 [info] Passed: Total 0, Failed 0, Errors 0, Passed 0
 [info] No tests to run for bagel/test:testOnly
 [info] Passed: Total 0, Failed 0, Errors 0, Passed 0
 [info] No tests to run for streaming/test:testOnly
 [info] Passed: Total 0, Failed 0, Errors 0, Passed 0
 [info] No tests to run for mllib/test:testOnly
 [info] Passed: Total 0, Failed 0, Errors 0, Passed 0
 [info] No tests to run for catalyst/test:testOnly
 [info] Passed: Total 0, Failed 0, Errors 0, Passed 0
 [info] No tests to run