Re: [ANNOUNCE] Welcoming new committers and PMC members

2024-07-24 Thread Ye Xianjin
Congrats all, well done !Sent from my iPhoneOn Jul 24, 2024, at 11:33 PM, Péter Váry  wrote:Congratulations all!Bryan Keller  ezt írta (időpont: 2024. júl. 24., Sze, 16:21):Congrats all!On Jul 24, 2024, at 3:14 AM, Eduard Tudenhöfner  wrote:Congrats everyone, it's amazing to see such great people contributing and improving the Iceberg community.On Wed, Jul 24, 2024 at 8:04 AM Honah J.  wrote:Thank you all! Congratulations to Kevin, Piotr, Sung, Xuanwo, and Renjie!It’s truly an honor to contribute to the project and witness the incredible growth of our community. I’m excited to continue working together and help drive the project to new heights!On Tue, Jul 23, 2024 at 7:24 PM Renjie Liu  wrote:Thanks everyone! And congratulations to  Pitor, Kevin, Sung, Xuanwo, Honah!It's awesome to work within and see the growth of this community!On Wed, Jul 24, 2024 at 9:41 AM Manu Zhang  wrote:Congrats everyone! Well done!Regards,ManuOn Wed, Jul 24, 2024 at 4:48 AM Piotr Findeisen  wrote:Dear all,thank you for your trust. Very much appreciatedKevin, Sung, Xuanwo, Honah, Renjie -- congratulations! it's awesome that your efforts were noticed and the value you bring to the table -- recognized.Best,PiotrOn Tue, 23 Jul 2024 at 18:56, Steve Zhang  wrote:Congrats everyone! 
Thanks,Steve Zhang


On Jul 23, 2024, at 9:20 AM, Anton Okolnychyi  wrote:Congrats everyone! 







Re: [DISCUSS] Differentiate Spark without Spark Connect from Spark Connect

2024-07-20 Thread Ye Xianjin
+1 for ClassicSent from my iPhoneOn Jul 21, 2024, at 10:15 AM, Ruifeng Zheng  wrote:+1 for 'Classic'On Sun, Jul 21, 2024 at 8:03 AM Xiao Li  wrote:Classic is much better than Legacy. : ) Hyukjin Kwon  于2024年7月18日周四 16:58写道:Hi all,I noticed that we need to standardize our terminology before moving forward. For instance, when documenting, 'Spark without Spark Connect' is too long and verbose. Additionally, I've observed that we use various names for Spark without Spark Connect: Spark Classic, Classic Spark, Legacy Spark, etc.I propose that we consistently refer to it as Spark Classic (vs. Spark Connect).Please share your thoughts on this. Thanks!




Re: [VOTE] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-30 Thread Ye Xianjin
+1 Sent from my iPhoneOn Apr 30, 2024, at 3:23 PM, DB Tsai  wrote:+1 On Apr 29, 2024, at 8:01 PM, Wenchen Fan  wrote:To add more color:Spark data source table and Hive Serde table are both stored in the Hive metastore and keep the data files in the table directory. The only difference is they have different "table provider", which means Spark will use different reader/writer. Ideally the Spark native data source reader/writer is faster than the Hive Serde ones.What's more, the default format of Hive Serde is text. I don't think people want to use text format tables in production. Most people will add `STORED AS parquet` or `USING parquet` explicitly. By setting this config to false, we have a more reasonable default behavior: creating Parquet tables (or whatever is specified by `spark.sql.sources.default`).On Tue, Apr 30, 2024 at 10:45 AM Wenchen Fan  wrote:@Mich Talebzadeh there seems to be a misunderstanding here. The Spark native data source table is still stored in the Hive metastore, it's just that Spark will use a different (and faster) reader/writer for it. `hive-site.xml` should work as it is today.On Tue, Apr 30, 2024 at 5:23 AM Hyukjin Kwon  wrote:+1It's a legacy conf that we should eventually remove it away. Spark should create Spark table by default, not Hive table.Mich, for your workload, you can simply switch that conf off if it concerns you. We also enabled ANSI as well (that you agreed on). It's a bit akwakrd to stop in the middle for this compatibility reason during making Spark sound. The compatibility has been tested in production for a long time so I don't see any particular issue about the compatibility case you mentioned.On Mon, Apr 29, 2024 at 2:08 AM Mich Talebzadeh  wrote:Hi @Wenchen Fan Thanks for your response. I believe we have not had enough time to "DISCUSS" this matter. Currently in order to make Spark take advantage of Hive, I create a soft link in $SPARK_HOME/conf. FYI, my spark version is 3.4.0 and Hive is 3.1.1 /opt/spark/conf/hive-site.xml -> /data6/hduser/hive-3.1.1/conf/hive-site.xmlThis works fine for me in my lab. So in the future if we opt to use the setting "spark.sql.legacy.createHiveTableByDefault" to False, there will not be a need for this logical link.? On the face of it, this looks fine but in real life it may require a number of changes to the old scripts. Hence my concern. As a matter of interest has anyone liaised with the Hive team to ensure they have introduced the additional changes you outlined?HTHMich Talebzadeh,Technologist | Architect | Data Engineer  | Generative AI | FinCrimeLondonUnited Kingdom

   view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh

 Disclaimer: The information provided is correct to the best of my knowledge but of course cannot be guaranteed . It is essential to note that, as with any advice, quote "one test result is worth one-thousand expert opinions (Werner Von Braun)".

On Sun, 28 Apr 2024 at 09:34, Wenchen Fan  wrote:@Mich Talebzadeh thanks for sharing your concern!Note: creating Spark native data source tables is usually Hive compatible as well, unless we use features that Hive does not support (TIMESTAMP NTZ, ANSI INTERVAL, etc.). I think it's a better default to create Spark native table in this case, instead of creating Hive table and fail.On Sat, Apr 27, 2024 at 12:46 PM Cheng Pan  wrote:+1 (non-binding)

Thanks,
Cheng Pan

On Sat, Apr 27, 2024 at 9:29 AM Holden Karau  wrote:
>
> +1
>
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>
>
> On Fri, Apr 26, 2024 at 12:06 PM L. C. Hsieh  wrote:
>>
>> +1
>>
>> On Fri, Apr 26, 2024 at 10:01 AM Dongjoon Hyun  wrote:
>> >
>> > I'll start with my +1.
>> >
>> > Dongjoon.
>> >
>> > On 2024/04/26 16:45:51 Dongjoon Hyun wrote:
>> > > Please vote on SPARK-46122 to set spark.sql.legacy.createHiveTableByDefault
>> > > to `false` by default. The technical scope is defined in the following PR.
>> > >
>> > > - DISCUSSION:
>> > > https://lists.apache.org/thread/ylk96fg4lvn6klxhj6t6yh42lyqb8wmd
>> > > - JIRA: https://issues.apache.org/jira/browse/SPARK-46122
>> > > - PR: https://github.com/apache/spark/pull/46207
>> > >
>> > > The vote is open until April 30th 1AM (PST) and passes
>> > > if a majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>> > >
>> > > [ ] +1 Set spark.sql.legacy.createHiveTableByDefault to false by default
>> > > [ ] -1 Do not change spark.sql.legacy.createHiveTableByDefault because ...
>> > >
>> > > Thank you in advance.
>> > >
>> > > Dongjoon
>> > >
>> >
>> > -
>> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>> >
>>
>> --

Re: Welcome two new Apache Spark committers

2023-08-06 Thread Ye Xianjin
Congratulations!Sent from my iPhoneOn Aug 7, 2023, at 11:16 AM, Yuming Wang  wrote:Congratulations!On Mon, Aug 7, 2023 at 11:11 AM Kent Yao  wrote:Congrats! Peter and Xiduo!

Cheng Pan  于2023年8月7日周一 11:01写道:
>
> Congratulations! Peter and Xiduo!
>
> Thanks,
> Cheng Pan
>
>
> > On Aug 7, 2023, at 10:58, Gengliang Wang  wrote:
> >
> > Congratulations! Peter and Xiduo!
>
>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org




Re: Running Spark on Kubernetes (GKE) - failing on spark-submit

2023-02-14 Thread Ye Xianjin
The configuration of ‘…file.upload.path’ is wrong. it means a distributed fs path to store your archives/resource/jars temporarily, then distributed by spark to drivers/executors. For your cases, you don’t need to set this configuration.Sent from my iPhoneOn Feb 14, 2023, at 5:43 AM, karan alang  wrote:Hello All,I'm trying to run a simple application on GKE (Kubernetes), and it is failing:Note : I have spark(bitnami spark chart) installed on GKE using helm install  Here is what is done :1. created a docker image using DockerfileDockerfile :```FROM python:3.7-slimRUN apt-get update && \apt-get install -y default-jre && \apt-get install -y openjdk-11-jre-headless && \apt-get cleanENV JAVA_HOME /usr/lib/jvm/java-11-openjdk-amd64RUN pip install pysparkRUN mkdir -p /myexample && chmod 755 /myexampleWORKDIR /myexampleCOPY src/StructuredStream-on-gke.py /myexample/StructuredStream-on-gke.pyCMD ["pyspark"]```Simple pyspark application :```from pyspark.sql import SparkSessionspark = SparkSession.builder.appName("StructuredStreaming-on-gke").getOrCreate()data = "" style="color:rgb(0,128,0);font-weight:bold">'k1', 123000), ('k2', 234000), ('k3', 456000)]df = spark.createDataFrame(data, ('id', 'salary'))df.show(5, False)```Spark-submit command :```





spark-submit --master k8s://https://34.74.22.140:7077 --deploy-mode cluster --name pyspark-example --conf spark.kubernetes.container.image=pyspark-example:0.1 --conf spark.kubernetes.file.upload.path=/myexample src/StructuredStream-on-gke.py```Error i get :```





23/02/13 13:18:27 INFO KubernetesUtils: Uploading file: /Users/karanalang/PycharmProjects/Kafka/pyspark-docker/src/StructuredStream-on-gke.py to dest: /myexample/spark-upload-12228079-d652-4bf3-b907-3810d275124a/StructuredStream-on-gke.py...
Exception in thread "main" org.apache.spark.SparkException: Uploading file /Users/karanalang/PycharmProjects/Kafka/pyspark-docker/src/StructuredStream-on-gke.py failed...
	at org.apache.spark.deploy.k8s.KubernetesUtils$.uploadFileUri(KubernetesUtils.scala:296)
	at org.apache.spark.deploy.k8s.KubernetesUtils$.renameMainAppResource(KubernetesUtils.scala:270)
	at org.apache.spark.deploy.k8s.features.DriverCommandFeatureStep.configureForPython(DriverCommandFeatureStep.scala:109)
	at org.apache.spark.deploy.k8s.features.DriverCommandFeatureStep.configurePod(DriverCommandFeatureStep.scala:44)
	at org.apache.spark.deploy.k8s.submit.KubernetesDriverBuilder.$anonfun$buildFromFeatures$3(KubernetesDriverBuilder.scala:59)
	at scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126)
	at scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122)
	at scala.collection.immutable.List.foldLeft(List.scala:89)
	at org.apache.spark.deploy.k8s.submit.KubernetesDriverBuilder.buildFromFeatures(KubernetesDriverBuilder.scala:58)
	at org.apache.spark.deploy.k8s.submit.Client.run(KubernetesClientApplication.scala:106)
	at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.$anonfun$run$3(KubernetesClientApplication.scala:213)
	at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.$anonfun$run$3$adapted(KubernetesClientApplication.scala:207)
	at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2622)
	at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.run(KubernetesClientApplication.scala:207)
	at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.start(KubernetesClientApplication.scala:179)
	at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:951)
	at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1039)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1048)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: org.apache.spark.SparkException: Error uploading file StructuredStream-on-gke.py
	at org.apache.spark.deploy.k8s.KubernetesUtils$.uploadFileToHadoopCompatibleFS(KubernetesUtils.scala:319)
	at org.apache.spark.deploy.k8s.KubernetesUtils$.uploadFileUri(KubernetesUtils.scala:292)
	... 21 more
Caused by: java.io.IOException: Mkdirs failed to create /myexample/spark-upload-12228079-d652-4bf3-b907-3810d275124a
	at org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:317)
	at org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:305)
	at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1098)
	at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:987)
	at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:414)
	at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:387)
	at org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:2369)
	at org.apache.hadoop.fs.FilterFileSystem.copyFromLocalFile(FilterFileSystem.java:368)
	at org.

Re: [VOTE] Accept Uniffle into the Apache Incubator

2022-05-30 Thread Ye Xianjin
+1 (no-binding)

Sent from my iPhone

> On May 31, 2022, at 10:46 AM, Aloys Zhang  wrote:
> 
> +1 (no-binding)
> 
> XiaoYu  于2022年5月31日周二 10:12写道:
> 
>> +1 (no-binding)
>> 
>> Xun Liu  于2022年5月31日周二 10:07写道:
>>> 
>>> +1 (binding) for me.
>>> 
>>> Good luck!
>>> 
>>> On Tue, May 31, 2022 at 10:04 AM Goson zhang 
>> wrote:
>>> 
 +1 (no-binding)
 
 Good luck!!
 
 tison  于2022年5月31日周二 09:43写道:
 
> +1 (binding)
> 
> Best,
> tison.
> 
> 
> Jerry Shao  于2022年5月31日周二 09:37写道:
> 
>> Hi all,
>> 
>> Following up the [DISCUSS] thread on Uniffle[1] and Firestorm[2], I
 would
>> like to
>> call a VOTE to accept Uniffle into the Apache Incubator, please
>> check
 out
>> the Uniffle Proposal from the incubator wiki[3].
>> 
>> Please cast your vote:
>> 
>> [ ] +1, bring Uniffle into the Incubator
>> [ ] +0, I don't care either way
>> [ ] -1, do not bring Uniffle into the Incubator, because...
>> 
>> The vote will open at least for 72 hours, and only votes from the
>> Incubator PMC are binding, but votes from everyone are welcome.
>> 
>> [1]
>> https://lists.apache.org/thread/fyyhkjvhzl4hpzr52hd64csh5lt2wm6h
>> [2]
>> https://lists.apache.org/thread/y07xjkqzvpchncym9zr1hgm3c4l4ql0f
>> [3]
> 
>> https://cwiki.apache.org/confluence/display/INCUBATOR/UniffleProposal
>> 
>> Best regards,
>> Jerry
>> 
> 
 
>> 
>> -
>> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
>> For additional commands, e-mail: general-h...@incubator.apache.org
>> 
>> 

-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org



Re: [DISCUSSION] Incubating Proposal of Uniffle

2022-05-24 Thread Ye Xianjin
+1 (non-binding).

Sent from my iPhone

> On May 25, 2022, at 9:59 AM, Goson zhang  wrote:
> 
> +1 (non-binding)
> 
> Good luck!
> 
> Daniel Widdis  于2022年5月25日周三 09:53写道:
> 
>> +1 (non-binding) from me!  Good luck!
>> 
>> On 5/24/22, 9:05 AM, "Jerry Shao"  wrote:
>> 
>>Hi all,
>> 
>>Due to the name issue in thread (
>>https://lists.apache.org/thread/y07xjkqzvpchncym9zr1hgm3c4l4ql0f), we
>>figured out a new project name "Uniffle" and created a new Thread.
>> Please
>>help to discuss.
>> 
>>We would like to propose Uniffle[1] as a new Apache incubator project,
>> you
>>can find the proposal here [2] for more details.
>> 
>>Uniffle is a high performance, general purpose Remote Shuffle Service
>> for
>>distributed compute engines like Apache Spark
>>, Apache
>>Hadoop MapReduce , Apache Flink
>> and so on. We are aiming to make
>> Firestorm a
>>universal shuffle service for distributed compute engines.
>> 
>>Shuffle is the key part for a distributed compute engine to exchange
>> the
>>data between distributed tasks, the performance and stability of
>> shuffle
>>will directly affect the whole job. Current “local file pull-like
>> shuffle
>>style” has several limitations:
>> 
>>   1. Current shuffle is hard to support super large workloads,
>> especially
>>   in a high load environment, the major problem is IO problem (random
>> disk IO
>>   issue, network congestion and timeout).
>>   2. Current shuffle is hard to deploy on the disaggregated compute
>>   storage environment, as disk capacity is quite limited on compute
>> nodes.
>>   3. The constraint of storing shuffle data locally makes it hard to
>> scale
>>   elastically.
>> 
>>Remote Shuffle Service is the key technology for enterprises to build
>> big
>>data platforms, to expand big data applications to disaggregated,
>>online-offline hybrid environments, and to solve above problems.
>> 
>>The implementation of Remote Shuffle Service -  “Uniffle”  - is heavily
>>adopted in Tencent, and shows its advantages in production. Other
>>enterprises also adopted or prepared to adopt Firestorm in their
>>environments.
>> 
>>Uniffle's key idea is brought from Salfish shuffle
>><
>> https://www.researchgate.net/publication/262241541_Sailfish_a_framework_for_large_scale_data_processing
>>> ,
>>it has several key design goals:
>> 
>>   1. High performance. Firestorm’s performance is close enough to
>> local
>>   file based shuffle style for small workloads. For large workloads,
>> it is
>>   far better than the current shuffle style.
>>   2. Fault tolerance. Firestorm provides high availability for
>> Coordinated
>>   nodes, and failover for Shuffle nodes.
>>   3. Pluggable. Firestorm is highly pluggable, which could be suited
>> to
>>   different compute engines, different backend storages, and different
>>   wire-protocols.
>> 
>>We believe that Uniffle project will provide the great value for the
>>community if it is accepted by the Apache incubator.
>> 
>>I will help this project as champion and many thanks to the 3 mentors:
>> 
>>   -
>> 
>>   Felix Cheung (felixche...@apache.org)
>>   - Junping du (junping...@apache.org)
>>   - Weiwei Yang (w...@apache.org)
>>   - Xun liu (liu...@apache.org)
>>   - Zhankun Tang (zt...@apache.org)
>> 
>> 
>>[1] https://github.com/Tencent/Firestorm
>>[2]
>> https://cwiki.apache.org/confluence/display/INCUBATOR/UniffleProposal
>> 
>>Best regards,
>>Jerry
>> 
>> 
>> 
>> -
>> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
>> For additional commands, e-mail: general-h...@incubator.apache.org
>> 
>> 

-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org



Re: Random expr in join key not support

2021-10-19 Thread Ye Xianjin
> For that, you can add a table subquery and do it in the select list.

Do you mean something like this:
select * from t1 join (select floor(random()*9) + id as x from t2) m on t1.id = 
m.x ?

Yes, that works. But that raise another question: theses two queries seem 
semantically equivalent, yet we treat them differently: one raises an analysis 
exception, one can work well. 
Should we treat them equally?




Sent from my iPhone

> On Oct 20, 2021, at 9:55 AM, Yingyi Bu  wrote:
> 
> 
> Per SQL spec, I think your join query can only be run as a NestedLoopJoin or 
> CartesianProduct.  See page 241 in SQL-99 
> (http://web.cecs.pdx.edu/~len/sql1999.pdf).
> In other words, it might be a correctness bug in other systems if they run 
> your query as a hash join.
> 
> > Here the purpose of adding a random in join key is to resolve the data skew 
> > problem.
> 
> For that, you can add a table subquery and do it in the select list.
> 
> Best,
> Yingyi
> 
> 
>> On Tue, Oct 19, 2021 at 12:46 AM Lantao Jin  wrote:
>> In PostgreSQL and Presto, the below query works well
>> sql> create table t1 (id int);
>> sql> create table t2 (id int);
>> sql> select * from t1 join t2 on t1.id = floor(random() * 9) + t2.id;
>> 
>> But it throws "Error in query: nondeterministic expressions are only allowed 
>> in Project, Filter, Aggregate or Window". Why Spark doesn't support random 
>> expressions in join condition?
>> Here the purpose to add a random in join key is to resolve the data skew 
>> problem.
>> 
>> Thanks,
>> Lantao


Re: [DISCUSS] SPIP: FunctionCatalog

2021-02-15 Thread Ye Xianjin
Hi,

Thanks for Ryan and Wenchen for leading this.

I’d like to add my two cents here. In production environments, the function 
catalog might be used by multiple systems, such as Spark, Presto and Hive.  Is 
it possible that this function catalog is designed with as an unified function 
catalog in mind, or at least it wouldn’t that difficult to extend this catalog 
as an unified one. 

P.S. We registered a lot of UDFs in hive HMS in our production environment, and 
those UDFs are shared by Spark and Presto. It works well even through with a 
lot of drawbacks.

Sent from my iPhone

> On Feb 16, 2021, at 2:44 AM, Ryan Blue  wrote:
> 
> 
> Thanks for the positive feedback, everyone. It sounds like there is a clear 
> path forward for calling functions. Even without a prototype, the `invoke` 
> plans show that Wenchen's suggested optimization can be done, and 
> incorporating it as an optional extension to this proposal solves many of the 
> unknowns.
> 
> With that area now understood, is there any discussion about other parts of 
> the proposal, besides the function call interface?
> 
>> On Fri, Feb 12, 2021 at 10:40 PM Chao Sun  wrote:
>> This is an important feature which can unblock several other projects 
>> including bucket join support for DataSource v2, complete support for 
>> enforcing DataSource v2 distribution requirements on the write path, etc. I 
>> like Ryan's proposals which look simple and elegant, with nice support on 
>> function overloading and variadic arguments. On the other hand, I think 
>> Wenchen made a very good point about performance. Overall, I'm excited to 
>> see active discussions on this topic and believe the community will come to 
>> a proposal with the best of both sides.
>> 
>> Chao
>> 
>>> On Fri, Feb 12, 2021 at 7:58 PM Hyukjin Kwon  wrote:
>>> +1 for Liang-chi's.
>>> 
>>> Thanks Ryan and Wenchen for leading this.
>>> 
>>> 
>>> 2021년 2월 13일 (토) 오후 12:18, Liang-Chi Hsieh 님이 작성:
 Basically I think the proposal makes sense to me and I'd like to support 
 the
 SPIP as it looks like we have strong need for the important feature.
 
 Thanks Ryan for working on this and I do also look forward to Wenchen's
 implementation. Thanks for the discussion too.
 
 Actually I think the SupportsInvoke proposed by Ryan looks a good
 alternative to me. Besides Wenchen's alternative implementation, is there a
 chance we also have the SupportsInvoke for comparison?
 
 
 John Zhuge wrote
 > Excited to see our Spark community rallying behind this important 
 > feature!
 > 
 > The proposal lays a solid foundation of minimal feature set with careful
 > considerations for future optimizations and extensions. Can't wait to see
 > it leading to more advanced functionalities like views with shared custom
 > functions, function pushdown, lambda, etc. It has already borne fruit 
 > from
 > the constructive collaborations in this thread. Looking forward to
 > Wenchen's prototype and further discussions including the SupportsInvoke
 > extension proposed by Ryan.
 > 
 > 
 > On Fri, Feb 12, 2021 at 4:35 PM Owen O'Malley <
 
 > owen.omalley@
 
 > >
 > wrote:
 > 
 >> I think this proposal is a very good thing giving Spark a standard way 
 >> of
 >> getting to and calling UDFs.
 >>
 >> I like having the ScalarFunction as the API to call the UDFs. It is
 >> simple, yet covers all of the polymorphic type cases well. I think it
 >> would
 >> also simplify using the functions in other contexts like pushing down
 >> filters into the ORC & Parquet readers although there are a lot of
 >> details
 >> that would need to be considered there.
 >>
 >> .. Owen
 >>
 >>
 >> On Fri, Feb 12, 2021 at 11:07 PM Erik Krogen <
 
 > ekrogen@.com
 
 > >
 >> wrote:
 >>
 >>> I agree that there is a strong need for a FunctionCatalog within Spark
 >>> to
 >>> provide support for shareable UDFs, as well as make movement towards
 >>> more
 >>> advanced functionality like views which themselves depend on UDFs, so I
 >>> support this SPIP wholeheartedly.
 >>>
 >>> I find both of the proposed UDF APIs to be sufficiently user-friendly
 >>> and
 >>> extensible. I generally think Wenchen's proposal is easier for a user 
 >>> to
 >>> work with in the common case, but has greater potential for confusing
 >>> and
 >>> hard-to-debug behavior due to use of reflective method signature
 >>> searches.
 >>> The merits on both sides can hopefully be more properly examined with
 >>> code,
 >>> so I look forward to seeing an implementation of Wenchen's ideas to
 >>> provide
 >>> a more concrete comparison. I am optimistic that we will not let the
 >>> debate
 >>> over this point unreasonably stall the SPIP from making progress.
 >>>
 >>> Thank

Re: Welcoming some new committers and PMC members

2019-09-09 Thread Ye Xianjin
Congratulations!

Sent from my iPhone

> On Sep 10, 2019, at 9:19 AM, Jeff Zhang  wrote:
> 
> Congratulations!
> 
> Saisai Shao  于2019年9月10日周二 上午9:16写道:
>> Congratulations!
>> 
>> Jungtaek Lim  于2019年9月9日周一 下午6:11写道:
>>> Congratulations! Well deserved!
>>> 
 On Tue, Sep 10, 2019 at 9:51 AM John Zhuge  wrote:
 Congratulations!
 
> On Mon, Sep 9, 2019 at 5:45 PM Shane Knapp  wrote:
> congrats everyone!  :)
> 
> On Mon, Sep 9, 2019 at 5:32 PM Matei Zaharia  
> wrote:
> >
> > Hi all,
> >
> > The Spark PMC recently voted to add several new committers and one PMC 
> > member. Join me in welcoming them to their new roles!
> >
> > New PMC member: Dongjoon Hyun
> >
> > New committers: Ryan Blue, Liang-Chi Hsieh, Gengliang Wang, Yuming 
> > Wang, Weichen Xu, Ruifeng Zheng
> >
> > The new committers cover lots of important areas including ML, SQL, and 
> > data sources, so it’s great to have them here. All the best,
> >
> > Matei and the Spark PMC
> >
> >
> > -
> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >
> 
> 
> -- 
> Shane Knapp
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
> 
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> 
 
 
 -- 
 John Zhuge
>>> 
>>> 
>>> -- 
>>> Name : Jungtaek Lim
>>> Blog : http://medium.com/@heartsavior
>>> Twitter : http://twitter.com/heartsavior
>>> LinkedIn : http://www.linkedin.com/in/heartsavior
> 
> 
> -- 
> Best Regards
> 
> Jeff Zhang


Re: Why does a 3.8 T dataset take up 11.59 Tb on HDFS

2015-11-24 Thread Ye Xianjin
Hi AlexG:

Files(blocks more specifically) has 3 copies on HDFS by default. So 3.8 * 3 = 
11.4TB.  

-- 
Ye Xianjin
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)


On Wednesday, November 25, 2015 at 2:31 PM, AlexG wrote:

> I downloaded a 3.8 T dataset from S3 to a freshly launched spark-ec2 cluster
> with 16.73 Tb storage, using
> distcp. The dataset is a collection of tar files of about 1.7 Tb each.
> Nothing else was stored in the HDFS, but after completing the download, the
> namenode page says that 11.59 Tb are in use. When I use hdfs du -h -s, I see
> that the dataset only takes up 3.8 Tb as expected. I navigated through the
> entire HDFS hierarchy from /, and don't see where the missing space is. Any
> ideas what is going on and how to rectify it?
> 
> I'm using the spark-ec2 script to launch, with the command
> 
> spark-ec2 -k key -i ~/.ssh/key.pem -s 29 --instance-type=r3.8xlarge
> --placement-group=pcavariants --copy-aws-credentials
> --hadoop-major-version=yarn --spot-price=2.8 --region=us-west-2 launch
> conversioncluster
> 
> and am not modifying any configuration files for Hadoop.
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Why-does-a-3-8-T-dataset-take-up-11-59-Tb-on-HDFS-tp25471.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com 
> (http://Nabble.com).
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
> (mailto:user-unsubscr...@spark.apache.org)
> For additional commands, e-mail: user-h...@spark.apache.org 
> (mailto:user-h...@spark.apache.org)
> 
> 




Re: An interesting and serious problem I encountered

2015-02-13 Thread Ye Xianjin
Hi, 

I believe SizeOf.jar may calculate the wrong size for you.
 Spark has a util call SizeEstimator located in 
org.apache.spark.util.SizeEstimator. And some one extracted it out in 
https://github.com/phatak-dev/java-sizeof/blob/master/src/main/scala/com/madhukaraphatak/sizeof/SizeEstimator.scala
You can try that out in the scala repl. 
The size for Array[Int](43) is 192bytes (12 bytes object size + 4 bytes length 
variable + (43 * 4 round to 176 bytes))
 And the size for (1, Array[Int](43)) is 240 bytes {
   Tuple2 Object: 12 bytes object size + 4 bytes filed _1 + 4 byes field _2 => 
round to 24 bytes
   1 =>  java.lang.Number 12  bytes => round to 16 bytes -> java.lang.Integer: 
16 bytes + 4 bytes int => round to 24 bytes ( Integer extends Number. I thought 
Scala Tuple2 will specialized Int and this should be 4, but it seems not)
   Array => 192 bytes
}

So, 24 + 24 + 192 = 240 bytes.
This is my calculation based on the spark SizeEstimator. 

However I am not sure what an Integer will occupy for 64 bits JVM with 
compressedOps on. It should be 12 + 4 = 16 bytes, then that means the 
SizeEstimator gives the wrong result. @Sean what do you think?
-- 
Ye Xianjin
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)


On Friday, February 13, 2015 at 2:26 PM, Landmark wrote:

> Hi foks,
> 
> My Spark cluster has 8 machines, each of which has 377GB physical memory,
> and thus the total maximum memory can be used for Spark is more than
> 2400+GB. In my program, I have to deal with 1 billion of (key, value) pairs,
> where the key is an integer and the value is an integer array with 43
> elements. Therefore, the memory cost of this raw dataset is [(1+43) *
> 10 * 4] / (1024 * 1024 * 1024) = 164GB. 
> 
> Since I have to use this dataset repeatedly, I have to cache it in memory.
> Some key parameter settings are: 
> spark.storage.fraction=0.6
> spark.driver.memory=30GB
> spark.executor.memory=310GB.
> 
> But it failed on running a simple countByKey() and the error message is
> "java.lang.OutOfMemoryError: Java heap space...". Does this mean a Spark
> cluster of 2400+GB memory cannot keep 164GB raw data in memory? 
> 
> The codes of my program is as follows:
> 
> def main(args: Array[String]):Unit = {
> val sc = new SparkContext(new SparkConfig());
> 
> val rdd = sc.parallelize(0 until 10, 25600).map(i => (i, new
> Array[Int](43))).cache();
> println("The number of keys is " + rdd.countByKey());
> 
> //some other operations following here ...
> }
> 
> 
> 
> 
> To figure out the issue, I evaluated the memory cost of key-value pairs and
> computed their memory cost using SizeOf.jar. The codes are as follows:
> 
> val arr = new Array[Int](43);
> println(SizeOf.humanReadable(SizeOf.deepSizeOf(arr)));
> 
> val tuple = (1, arr.clone);
> println(SizeOf.humanReadable(SizeOf.deepSizeOf(tuple)));
> 
> The output is:
> 192.0b
> 992.0b
> 
> 
> *Hard to believe, but it is true!! This result means, to store a key-value
> pair, Tuple2 needs more than 5+ times memory than the simplest method with
> array. Even though it may take 5+ times memory, its size is less than
> 1000GB, which is still much less than the total memory size of my cluster,
> i.e., 2400+GB. I really do not understand why this happened.*
> 
> BTW, if the number of pairs is 1 million, it works well. If the arr contains
> only 1 integer, to store a pair, Tuples needs around 10 times memory.
> 
> So I have some questions:
> 1. Why does Spark choose such a poor data structure, Tuple2, for key-value
> pairs? Is there any better data structure for storing (key, value) pairs
> with less memory cost ?
> 2. Given a dataset with size of M, in general Spark how many times of memory
> to handle it?
> 
> 
> Best,
> Landmark
> 
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/An-interesting-and-serious-problem-I-encountered-tp21637.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com 
> (http://Nabble.com).
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
> (mailto:user-unsubscr...@spark.apache.org)
> For additional commands, e-mail: user-h...@spark.apache.org 
> (mailto:user-h...@spark.apache.org)
> 
> 




Re: Welcoming three new committers

2015-02-03 Thread Ye Xianjin
Congratulations!

-- 
Ye Xianjin
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)


On Wednesday, February 4, 2015 at 6:34 AM, Matei Zaharia wrote:

> Hi all,
> 
> The PMC recently voted to add three new committers: Cheng Lian, Joseph 
> Bradley and Sean Owen. All three have been major contributors to Spark in the 
> past year: Cheng on Spark SQL, Joseph on MLlib, and Sean on ML and many 
> pieces throughout Spark Core. Join me in welcoming them as committers!
> 
> Matei
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org 
> (mailto:dev-unsubscr...@spark.apache.org)
> For additional commands, e-mail: dev-h...@spark.apache.org 
> (mailto:dev-h...@spark.apache.org)
> 
> 




[jira] [Commented] (SPARK-4631) Add real unit test for MQTT

2015-01-29 Thread Ye Xianjin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14296812#comment-14296812
 ] 

Ye Xianjin commented on SPARK-4631:
---

[~dragos], Thread.sleep(50) do pass the test on my machine. 

> Add real unit test for MQTT 
> 
>
> Key: SPARK-4631
> URL: https://issues.apache.org/jira/browse/SPARK-4631
> Project: Spark
>  Issue Type: Test
>  Components: Streaming
>Reporter: Tathagata Das
>Priority: Critical
> Fix For: 1.3.0
>
>
> A real unit test that actually transfers data to ensure that the MQTTUtil is 
> functional



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4631) Add real unit test for MQTT

2015-01-29 Thread Ye Xianjin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14296811#comment-14296811
 ] 

Ye Xianjin commented on SPARK-4631:
---

[~dragos], Thread.sleep(50) do pass the test on my machine. 

> Add real unit test for MQTT 
> 
>
> Key: SPARK-4631
> URL: https://issues.apache.org/jira/browse/SPARK-4631
> Project: Spark
>  Issue Type: Test
>  Components: Streaming
>Reporter: Tathagata Das
>Priority: Critical
> Fix For: 1.3.0
>
>
> A real unit test that actually transfers data to ensure that the MQTTUtil is 
> functional



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-4631) Add real unit test for MQTT

2015-01-29 Thread Ye Xianjin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ye Xianjin updated SPARK-4631:
--
Comment: was deleted

(was: [~dragos], Thread.sleep(50) do pass the test on my machine. )

> Add real unit test for MQTT 
> 
>
> Key: SPARK-4631
> URL: https://issues.apache.org/jira/browse/SPARK-4631
> Project: Spark
>  Issue Type: Test
>  Components: Streaming
>Reporter: Tathagata Das
>Priority: Critical
> Fix For: 1.3.0
>
>
> A real unit test that actually transfers data to ensure that the MQTTUtil is 
> functional



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4631) Add real unit test for MQTT

2015-01-28 Thread Ye Xianjin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14296232#comment-14296232
 ] 

Ye Xianjin commented on SPARK-4631:
---

Hi [~dragos], I have the same issue here. I'd like to copy the email I sent to 
Sean here, which may help. 

{quote}
Hi Sean:

I enabled the debug flag in log4j. I believe the MQRRStreamSuite failure is 
more likely due to some weird network issue. However I cannot understand why 
this exception will be thrown.

what I saw in the unit-tests.log is below:
15/01/28 23:41:37.390 ActiveMQ Transport: tcp:///127.0.0.1:53845@23456 DEBUG 
Transport: Transport Connection to: tcp://127.0.0.1:53845 failed: 
java.net.ProtocolException: Invalid CONNECT encoding
java.net.ProtocolException: Invalid CONNECT encoding
at org.fusesource.mqtt.codec.CONNECT.decode(CONNECT.java:77)
at 
org.apache.activemq.transport.mqtt.MQTTProtocolConverter.onMQTTCommand(MQTTProtocolConverter.java:118)
at 
org.apache.activemq.transport.mqtt.MQTTTransportFilter.onCommand(MQTTTransportFilter.java:74)
at 
org.apache.activemq.transport.TransportSupport.doConsume(TransportSupport.java:83)
at 
org.apache.activemq.transport.tcp.TcpTransport.doRun(TcpTransport.java:222)
at 
org.apache.activemq.transport.tcp.TcpTransport.run(TcpTransport.java:204)
at java.lang.Thread.run(Thread.java:695)

However when I looked at the code 
http://grepcode.com/file/repo1.maven.org/maven2/org.fusesource.mqtt-client/mqtt-client/1.3/org/fusesource/mqtt/codec/CONNECT.java#76
 , I don’t quite understand why that would happen.
I am not familiar with activemq, maybe you can look at this and figure what 
really happened.
{quote}

The possible cause for that failure is that maybe org.eclipse.paho.mqtt-client 
don't write PROTOCOL_NAME in the mqtt frame with a quick look at the 
paho.mqtt-client code. But it don't make sense as the Jenkins run test 
successfully and I am not sure.

> Add real unit test for MQTT 
> 
>
> Key: SPARK-4631
> URL: https://issues.apache.org/jira/browse/SPARK-4631
> Project: Spark
>  Issue Type: Test
>  Components: Streaming
>Reporter: Tathagata Das
>Priority: Critical
> Fix For: 1.3.0
>
>
> A real unit test that actually transfers data to ensure that the MQTTUtil is 
> functional



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.2.1 (RC2)

2015-01-28 Thread Ye Xianjin
Sean,  
the MQRRStreamSuite is also failed for me on Mac OS X, Though I don’t have time 
to invest that.

--  
Ye Xianjin
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)


On Wednesday, January 28, 2015 at 9:17 PM, Sean Owen wrote:

> +1 (nonbinding). I verified that all the hash / signing items I
> mentioned before are resolved.
>  
> The source package compiles on Ubuntu / Java 8. I ran tests and the
> passed. Well, actually I see the same failure I've seeing locally on
> OS X and on Ubuntu for a while, but I think nobody else has seen this?
>  
> MQTTStreamSuite:
> - mqtt input stream *** FAILED ***
> org.eclipse.paho.client.mqttv3.MqttException: Too many publishes in progress
> at 
> org.eclipse.paho.client.mqttv3.internal.ClientState.send(ClientState.java:423)
>  
> Doesn't happen on Jenkins. If nobody else is seeing this, I suspect it
> is something perhaps related to my env that I haven't figured out yet,
> so should not be considered a blocker.
>  
> On Wed, Jan 28, 2015 at 10:06 AM, Patrick Wendell  (mailto:pwend...@gmail.com)> wrote:
> > Please vote on releasing the following candidate as Apache Spark version 
> > 1.2.1!
> >  
> > The tag to be voted on is v1.2.1-rc1 (commit b77f876):
> > https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=b77f87673d1f9f03d4c83cf583158227c551359b
> >  
> > The release files, including signatures, digests, etc. can be found at:
> > http://people.apache.org/~pwendell/spark-1.2.1-rc2/
> >  
> > Release artifacts are signed with the following key:
> > https://people.apache.org/keys/committer/pwendell.asc
> >  
> > The staging repository for this release can be found at:
> > https://repository.apache.org/content/repositories/orgapachespark-1062/
> >  
> > The documentation corresponding to this release can be found at:
> > http://people.apache.org/~pwendell/spark-1.2.1-rc2-docs/
> >  
> > Changes from rc1:
> > This has no code changes from RC1. Only minor changes to the release script.
> >  
> > Please vote on releasing this package as Apache Spark 1.2.1!
> >  
> > The vote is open until Saturday, January 31, at 10:04 UTC and passes
> > if a majority of at least 3 +1 PMC votes are cast.
> >  
> > [ ] +1 Release this package as Apache Spark 1.2.1
> > [ ] -1 Do not release this package because ...
> >  
> > For a list of fixes in this release, see http://s.apache.org/Mpn.
> >  
> > To learn more about Apache Spark, please see
> > http://spark.apache.org/
> >  
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org 
> > (mailto:dev-unsubscr...@spark.apache.org)
> > For additional commands, e-mail: dev-h...@spark.apache.org 
> > (mailto:dev-h...@spark.apache.org)
> >  
>  
>  
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org 
> (mailto:dev-unsubscr...@spark.apache.org)
> For additional commands, e-mail: dev-h...@spark.apache.org 
> (mailto:dev-h...@spark.apache.org)
>  
>  




Re: Can't run Spark java code from command line

2015-01-13 Thread Ye Xianjin
There is no binding issue here. Spark picks the right ip 10.211.55.3 for you. 
The printed message is just an indication.
 However I have no idea why spark-shell hangs or stops.

发自我的 iPhone

> 在 2015年1月14日,上午5:10,Akhil Das  写道:
> 
> It just a binding issue with the hostnames in your /etc/hosts file. You can 
> set SPARK_LOCAL_IP and SPARK_MASTER_IP in your conf/spark-env.sh file and 
> restart your cluster. (in that case the spark://myworkstation:7077 will 
> change to the ip address that you provided eg: spark://10.211.55.3).
> 
> Thanks
> Best Regards
> 
>> On Tue, Jan 13, 2015 at 11:15 PM, jeremy p  
>> wrote:
>> Hello all,
>> 
>> I wrote some Java code that uses Spark, but for some reason I can't run it 
>> from the command line.  I am running Spark on a single node (my 
>> workstation). The program stops running after this line is executed :
>> 
>> SparkContext sparkContext = new SparkContext("spark://myworkstation:7077", 
>> "sparkbase");
>> 
>> When that line is executed, this is printed to the screen : 
>> 15/01/12 15:56:19 WARN util.Utils: Your hostname, myworkstation resolves to 
>> a loopback address: 127.0.1.1; using 10.211.55.3 instead (on interface eth0)
>> 15/01/12 15:56:19 WARN util.Utils: Set SPARK_LOCAL_IP if you need to bind to 
>> another address
>> 15/01/12 15:56:19 INFO spark.SecurityManager: Changing view acls to: 
>> myusername
>> 15/01/12 15:56:19 INFO spark.SecurityManager: Changing modify acls to: 
>> myusername
>> 15/01/12 15:56:19 INFO spark.SecurityManager: SecurityManager: 
>> authentication disabled; ui acls disabled; users with view permissions: 
>> Set(myusername); users with modify permissions: Set(myusername)
>> 
>> After it writes this to the screen, the program stops executing without 
>> reporting an exception.
>> 
>> What's odd is that when I run this code from Eclipse, the same lines are 
>> printed to the screen, but the program keeps executing.
>> 
>> Don't know if it matters, but I'm using the maven assembly plugin, which 
>> includes the dependencies in the JAR.
>> 
>> Here are the versions I'm using :
>> Cloudera : 2.5.0-cdh5.2.1
>> Hadoop : 2.5.0-cdh5.2.1
>> HBase : HBase 0.98.6-cdh5.2.1
>> Java : 1.7.0_65
>> Ubuntu : 14.04.1 LTS
>> Spark : 1.2
> 


[jira] [Commented] (SPARK-5201) ParallelCollectionRDD.slice(seq, numSlices) has int overflow when dealing with inclusive range

2015-01-11 Thread Ye Xianjin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14273277#comment-14273277
 ] 

Ye Xianjin commented on SPARK-5201:
---

I will send a pr for this.

> ParallelCollectionRDD.slice(seq, numSlices) has int overflow when dealing 
> with inclusive range
> --
>
> Key: SPARK-5201
> URL: https://issues.apache.org/jira/browse/SPARK-5201
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Ye Xianjin
>  Labels: rdd
> Fix For: 1.2.1
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> {code}
>  sc.makeRDD(1 to (Int.MaxValue)).count   // result = 0
>  sc.makeRDD(1 to (Int.MaxValue - 1)).count   // result = 2147483646 = 
> Int.MaxValue - 1
>  sc.makeRDD(1 until (Int.MaxValue)).count// result = 2147483646 = 
> Int.MaxValue - 1
> {code}
> More details on the discussion https://github.com/apache/spark/pull/2874



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5201) ParallelCollectionRDD.slice(seq, numSlices) has int overflow when dealing with inclusive range

2015-01-11 Thread Ye Xianjin (JIRA)
Ye Xianjin created SPARK-5201:
-

 Summary: ParallelCollectionRDD.slice(seq, numSlices) has int 
overflow when dealing with inclusive range
 Key: SPARK-5201
 URL: https://issues.apache.org/jira/browse/SPARK-5201
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Ye Xianjin
 Fix For: 1.2.1


{code}
 sc.makeRDD(1 to (Int.MaxValue)).count   // result = 0
 sc.makeRDD(1 to (Int.MaxValue - 1)).count   // result = 2147483646 = 
Int.MaxValue - 1
 sc.makeRDD(1 until (Int.MaxValue)).count// result = 2147483646 = 
Int.MaxValue - 1
{code}
More details on the discussion https://github.com/apache/spark/pull/2874



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: Is it safe to use Scala 2.11 for Spark build?

2014-11-17 Thread Ye Xianjin
lLock$$withChannelRetries$1(Locks.scala:78)
at xsbt.boot.Locks$GlobalLock$$anonfun$withFileLock$1.apply(Locks.scala:97)
at xsbt.boot.Using$.withResource(Using.scala:10)
at xsbt.boot.Using$.apply(Using.scala:9)
at xsbt.boot.Locks$GlobalLock.ignoringDeadlockAvoided(Locks.scala:58)
at xsbt.boot.Locks$GlobalLock.withLock(Locks.scala:48)
commit c6e0c2ab1c29c184a9302d23ad75e4ccd8060242
at xsbt.boot.Locks$.apply0(Locks.scala:31)
at xsbt.boot.Locks$.apply(Locks.scala:28)
at sbt.IvySbt.withDefaultLogger(Ivy.scala:64)
at sbt.IvySbt.withIvy(Ivy.scala:119)
at sbt.IvySbt.withIvy(Ivy.scala:116)
at sbt.IvySbt$Module.withModule(Ivy.scala:147)
at sbt.IvyActions$.updateEither(IvyActions.scala:156)
at sbt.Classpaths$$anonfun$sbt$Classpaths$$work$1$1.apply(Defaults.scala:1282)
at sbt.Classpaths$$anonfun$sbt$Classpaths$$work$1$1.apply(Defaults.scala:1279)
at sbt.Classpaths$$anonfun$doWork$1$1$$anonfun$84.apply(Defaults.scala:1309)
at sbt.Classpaths$$anonfun$doWork$1$1$$anonfun$84.apply(Defaults.scala:1307)
at sbt.Tracked$$anonfun$lastOutput$1.apply(Tracked.scala:35)
at sbt.Classpaths$$anonfun$doWork$1$1.apply(Defaults.scala:1312)
at sbt.Classpaths$$anonfun$doWork$1$1.apply(Defaults.scala:1306)
at sbt.Tracked$$anonfun$inputChanged$1.apply(Tracked.scala:45)
at sbt.Classpaths$.cachedUpdate(Defaults.scala:1324)
at sbt.Classpaths$$anonfun$updateTask$1.apply(Defaults.scala:1264)
at sbt.Classpaths$$anonfun$updateTask$1.apply(Defaults.scala:1242)
at scala.Function1$$anonfun$compose$1.apply(Function1.scala:47)
at sbt.$tilde$greater$$anonfun$$u2219$1.apply(TypeFunctions.scala:40)
at sbt.std.Transform$$anon$4.work(System.scala:63)
at sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:226)
at sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:226)
at sbt.ErrorHandling$.wideConvert(ErrorHandling.scala:17)
at sbt.Execute.work(Execute.scala:235)
at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:226)
at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:226)
at 
sbt.ConcurrentRestrictions$$anon$4$$anonfun$1.apply(ConcurrentRestrictions.scala:159)
at sbt.CompletionService$$anon$2.call(CompletionService.scala:28)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
[error] (streaming-kafka/*:update) sbt.ResolveException: unresolved dependency: 
org.apache.kafka#kafka_2.11;0.8.0: not found
[error] (catalyst/*:update) sbt.ResolveException: unresolved dependency: 
org.scalamacros#quasiquotes_2.11;2.0.1: not found



-- 
Ye Xianjin
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)


On Tuesday, November 18, 2014 at 3:27 PM, Prashant Sharma wrote:

> It is safe in the sense we would help you with the fix if you run into 
> issues. I have used it, but since I worked on the patch the opinion can be 
> biased. I am using scala 2.11 for day to day development. You should checkout 
> the build instructions here : 
> https://github.com/ScrapCodes/spark-1/blob/patch-3/docs/building-spark.md 
> 
> Prashant Sharma
> 
> 
> 
> On Tue, Nov 18, 2014 at 12:19 PM, Jianshi Huang  (mailto:jianshi.hu...@gmail.com)> wrote:
> > Any notable issues for using Scala 2.11? Is it stable now?
> > 
> > Or can I use Scala 2.11 in my spark application and use Spark dist build 
> > with 2.10 ?
> > 
> > I'm looking forward to migrate to 2.11 for some quasiquote features. 
> > Couldn't make it run in 2.10...
> > 
> > Cheers,-- 
> > Jianshi Huang
> > 
> > LinkedIn: jianshi
> > Twitter: @jshuang
> > Github & Blog: http://huangjs.github.com/
> 



[jira] [Commented] (FLUME-2385) Flume spans log file with "Spooling Directory Source runner has shutdown" messages at INFO level

2014-11-10 Thread Ye Xianjin (JIRA)

[ 
https://issues.apache.org/jira/browse/FLUME-2385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14205891#comment-14205891
 ] 

Ye Xianjin commented on FLUME-2385:
---

hi, [~scaph01], I think (according to my colleague) the more reasonable change 
is to set the log level to debug. 

> Flume spans log file with "Spooling Directory Source runner has shutdown" 
> messages at INFO level
> 
>
> Key: FLUME-2385
> URL: https://issues.apache.org/jira/browse/FLUME-2385
> Project: Flume
>  Issue Type: Improvement
>Affects Versions: v1.4.0
>Reporter: Justin Hayes
>Assignee: Phil Scala
>Priority: Minor
> Fix For: v1.6.0
>
> Attachments: FLUME-2385-0.patch
>
>
> When I start an agent with the following config, the spooling directory 
> source emits "14/05/14 22:36:12 INFO source.SpoolDirectorySource: Spooling 
> Directory Source runner has shutdown." messages twice a second. Pretty 
> innocuous but it will fill up the file system needlessly and get in the way 
> of other INFO messages.
> cis.sources = httpd
> cis.sinks = loggerSink
> cis.channels = mem2logger
> cis.sources.httpd.type = spooldir
> cis.sources.httpd.spoolDir = /var/log/httpd
> cis.sources.httpd.trackerDir = /var/lib/flume-ng/tracker/httpd
> cis.sources.httpd.channels = mem2logger
> cis.sinks.loggerSink.type = logger
> cis.sinks.loggerSink.channel = mem2logger
> cis.channels.mem2logger.type = memory
> cis.channels.mem2logger.capacity = 1
> cis.channels.mem2logger.transactionCapacity = 1000 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (SPARK-4002) JavaKafkaStreamSuite.testKafkaStream fails on OSX

2014-10-22 Thread Ye Xianjin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14179753#comment-14179753
 ] 

Ye Xianjin commented on SPARK-4002:
---

Hi, [~rdub] what's your mac os x's hostname ? Mine was advancedxy's-pro. 
notice the illegal ['] there in the hostname. That's causing Kafka failing. 
That's what I saw a Kafka related test failure couple weeks ago. Hope It's 
related. 
The detail is in the unit-tests.log. So, as [~jerryshao] said, It's better you 
post your unit test log here and we may get the real cause.

> JavaKafkaStreamSuite.testKafkaStream fails on OSX
> -
>
> Key: SPARK-4002
> URL: https://issues.apache.org/jira/browse/SPARK-4002
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
> Environment: Mac OSX 10.9.5.
>Reporter: Ryan Williams
>
> [~sowen] mentioned this on spark-dev 
> [here|http://mail-archives.apache.org/mod_mbox/spark-dev/201409.mbox/%3ccamassdjs0fmsdc-k-4orgbhbfz2vvrmm0hfyifeeal-spft...@mail.gmail.com%3E]
>  and I just reproduced it on {{master}} 
> ([7e63bb4|https://github.com/apache/spark/commit/7e63bb49c526c3f872619ae14e4b5273f4c535e9]).
> The relevant output I get when running {{./dev/run-tests}} is:
> {code}
> [info] KafkaStreamSuite:
> [info] - Kafka input stream
> [info] Test run started
> [info] Test 
> org.apache.spark.streaming.kafka.JavaKafkaStreamSuite.testKafkaStream started
> [error] Test 
> org.apache.spark.streaming.kafka.JavaKafkaStreamSuite.testKafkaStream failed: 
> junit.framework.AssertionFailedError: expected:<3> but was:<0>
> [error] at junit.framework.Assert.fail(Assert.java:50)
> [error] at junit.framework.Assert.failNotEquals(Assert.java:287)
> [error] at junit.framework.Assert.assertEquals(Assert.java:67)
> [error] at junit.framework.Assert.assertEquals(Assert.java:199)
> [error] at junit.framework.Assert.assertEquals(Assert.java:205)
> [error] at 
> org.apache.spark.streaming.kafka.JavaKafkaStreamSuite.testKafkaStream(JavaKafkaStreamSuite.java:129)
> [error] ...
> [info] Test run finished: 1 failed, 0 ignored, 1 total, 19.798s
> {code}
> Seems like this test should be {{@Ignore}}'d, or some note about this made on 
> the {{README.md}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: thank you for reviewing our patches

2014-09-26 Thread Ye Xianjin



On Saturday, September 27, 2014 at 4:50 AM, Nicholas Chammas wrote:

>  
> Spark is the first (and currently only) open source project I contribute
> regularly to. My first several PRs against the project, as simple as they
> were, were definitely patches that I “grew up on”

 Yeah, me too. I think I will keep patching.  
> I appreciate the time and effort all the reviewers I’ve interacted with
> have taken to work with me on my PRs, even when they are “trivial”. And I’m
> sure that as I continue to contribute to this project there will be many
> more patches that I will “grow up on”.
>  
> Thank you Patrick, Reynold, Josh, Davies, Michael, and everyone else who’s
> taken time to review one of my patches. I appreciate it!
>  
> Nick
> ​
>  
>  

And yes, the reviewer are very nice and responsive. It's a pleasure 
communicating with   
the reviewers, Thank you for your time.  


Re: spark_classpath in core/pom.xml and yarn/pom.xml

2014-09-25 Thread Ye Xianjin
Hi Sandy, 

Sorry for the bothering. 

The tests run ok even the SPARK_CLASS setting is there now, but It gives a 
config warning and will potential interfere other settings like Marcelo said. 
The warning goes away if I remove it out.

And Marcelo, I believe the setting in core/pom should not be used any more. But 
I don't think it's worthy to file a JIRA for such small change. Maybe put it 
into other related JIRA. It's a pity that your pr
already got merged.
 

-- 
Ye Xianjin
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)


On Friday, September 26, 2014 at 6:29 AM, Sandy Ryza wrote:

> Hi Ye,
> 
> I think git blame shows me because I fixed the formatting in core/pom.xml, 
> but I don't actually know the original reason for setting SPARK_CLASSPATH 
> there.
> 
> Do the tests run OK if you take it out?
> 
> -Sandy
> 
> 
> On Thu, Sep 25, 2014 at 1:59 AM, Ye Xianjin  (mailto:advance...@gmail.com)> wrote:
> > hi, Sandy Ryza:
> >  I believe It's you originally added the SPARK_CLASSPATH in 
> > core/pom.xml in the org.scalatest section. Does this still needed in 1.1?
> >  I noticed this setting because when I looked into the unit-tests.log, 
> > It shows something below:
> > > 14/09/24 23:57:19.246 WARN SparkConf:
> > > SPARK_CLASSPATH was detected (set to 'null').
> > > This is deprecated in Spark 1.0+.
> > >
> > > Please instead use:
> > >  - ./spark-submit with --driver-class-path to augment the driver classpath
> > >  - spark.executor.extraClassPath to augment the executor classpath
> > >
> > > 14/09/24 23:57:19.246 WARN SparkConf: Setting 
> > > 'spark.executor.extraClassPath' to 'null' as a work-around.
> > > 14/09/24 23:57:19.247 WARN SparkConf: Setting 
> > > 'spark.driver.extraClassPath' to 'null' as a work-around.
> > 
> > However I didn't set SPARK_CLASSPATH env variable. And looked into the 
> > SparkConf.scala, If user actually set extraClassPath,  the SparkConf will 
> > throw SparkException.
> > --
> > Ye Xianjin
> > Sent with Sparrow (http://www.sparrowmailapp.com/?sig)
> > 
> > 
> > On Tuesday, September 23, 2014 at 12:56 AM, Ye Xianjin wrote:
> > 
> > > Hi:
> > > I notice the scalatest-maven-plugin set SPARK_CLASSPATH environment 
> > > variable for testing. But in the SparkConf.scala, this is deprecated in 
> > > Spark 1.0+.
> > > So what this variable for? should we just remove this variable?
> > >
> > >
> > > --
> > > Ye Xianjin
> > > Sent with Sparrow (http://www.sparrowmailapp.com/?sig)
> > >
> > 
> 



Re: spark_classpath in core/pom.xml and yarn/porm.xml

2014-09-25 Thread Ye Xianjin
hi, Sandy Ryza:
 I believe It's you originally added the SPARK_CLASSPATH in core/pom.xml in 
the org.scalatest section. Does this still needed in 1.1?
 I noticed this setting because when I looked into the unit-tests.log, It 
shows something below:
> 14/09/24 23:57:19.246 WARN SparkConf:
> SPARK_CLASSPATH was detected (set to 'null').
> This is deprecated in Spark 1.0+.
> 
> Please instead use:
>  - ./spark-submit with --driver-class-path to augment the driver classpath
>  - spark.executor.extraClassPath to augment the executor classpath
> 
> 14/09/24 23:57:19.246 WARN SparkConf: Setting 'spark.executor.extraClassPath' 
> to 'null' as a work-around.
> 14/09/24 23:57:19.247 WARN SparkConf: Setting 'spark.driver.extraClassPath' 
> to 'null' as a work-around.

However I didn't set SPARK_CLASSPATH env variable. And looked into the 
SparkConf.scala, If user actually set extraClassPath,  the SparkConf will throw 
SparkException.
-- 
Ye Xianjin
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)


On Tuesday, September 23, 2014 at 12:56 AM, Ye Xianjin wrote:

> Hi:
> I notice the scalatest-maven-plugin set SPARK_CLASSPATH environment 
> variable for testing. But in the SparkConf.scala, this is deprecated in Spark 
> 1.0+.
> So what this variable for? should we just remove this variable?
> 
> 
> -- 
> Ye Xianjin
> Sent with Sparrow (http://www.sparrowmailapp.com/?sig)
> 



spark_classpath in core/pom.xml and yarn/porm.xml

2014-09-22 Thread Ye Xianjin
Hi:
I notice the scalatest-maven-plugin set SPARK_CLASSPATH environment 
variable for testing. But in the SparkConf.scala, this is deprecated in Spark 
1.0+.
So what this variable for? should we just remove this variable?


-- 
Ye Xianjin
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)



java_home detection bug in maven 3.2.3

2014-09-18 Thread Ye Xianjin
Hi, Developers:
  I found this bug today on Mac OS X 10.10. 

  Maven version: 3.2.3
  File path: apache-maven-3.2.3/apache-maven/src/bin/mvn  line86
  Code snippet:
  
   if [[ -z "$JAVA_HOME" && -x "/usr/libexec/java_home" ]] ; then
 #
 # Apple JDKs
 #
 export JAVA_HOME=/usr/libexec/java_home
   fi
   
   It should be :

   if [[ -z "$JAVA_HOME" && -x "/usr/libexec/java_home" ]] ; then
 #
 # Apple JDKs
 #
 export JAVA_HOME=`/usr/libexec/java_home`
   fi

   I wanted to file a jira to http://jira.codehaus.org 
(http://jira.codehaus.org/). But it seems it's not open for registration. So I 
think maybe it's a good idea to send an email here.

-- 
Ye Xianjin
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)



Re: groupBy gives non deterministic results

2014-09-10 Thread Ye Xianjin
Well, That's weird. I don't see this thread in my mail box as sending to user 
list. Maybe because I also subscribe the incubator mail list? I do see mails 
sending to incubator mail list and no one replies. I thought it was because 
people don't subscribe the incubator now.

-- 
Ye Xianjin
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)


On Thursday, September 11, 2014 at 12:12 AM, Davies Liu wrote:

> I think the mails to spark.incubator.apache.org 
> (http://spark.incubator.apache.org) will be forwarded to
> spark.apache.org (http://spark.apache.org).
> 
> Here is the header of the first mail:
> 
> from: redocpot mailto:julien19890...@gmail.com)>
> to: u...@spark.incubator.apache.org (mailto:u...@spark.incubator.apache.org)
> date: Mon, Sep 8, 2014 at 7:29 AM
> subject: groupBy gives non deterministic results
> mailing list: user.spark.apache.org (http://user.spark.apache.org) Filter 
> messages from this mailing list
> mailed-by: spark.apache.org (http://spark.apache.org)
> 
> I only subscribe spark.apache.org (http://spark.apache.org), and I do see all 
> the mails from he.
> 
> On Wed, Sep 10, 2014 at 6:29 AM, Ye Xianjin  (mailto:advance...@gmail.com)> wrote:
> > | Do the two mailing lists share messages ?
> > I don't think so. I didn't receive this message from the user list. I am
> > not in databricks, so I can't answer your other questions. Maybe Davies Liu
> > mailto:dav...@databricks.com)> can answer you?
> > 
> > --
> > Ye Xianjin
> > Sent with Sparrow
> > 
> > On Wednesday, September 10, 2014 at 9:05 PM, redocpot wrote:
> > 
> > Hi, Xianjin
> > 
> > I checked user@spark.apache.org (mailto:user@spark.apache.org), and found 
> > my post there:
> > http://mail-archives.apache.org/mod_mbox/spark-user/201409.mbox/browser
> > 
> > I am using nabble to send this mail, which indicates that the mail will be
> > sent from my email address to the u...@spark.incubator.apache.org 
> > (mailto:u...@spark.incubator.apache.org) mailing
> > list.
> > 
> > Do the two mailing lists share messages ?
> > 
> > Do we have a nabble interface for user@spark.apache.org 
> > (mailto:user@spark.apache.org) mail list ?
> > 
> > Thank you.
> > 
> > 
> > 
> > 
> > --
> > View this message in context:
> > http://apache-spark-user-list.1001560.n3.nabble.com/groupBy-gives-non-deterministic-results-tp13698p13876.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com 
> > (http://Nabble.com).
> > 
> > -
> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
> > (mailto:user-unsubscr...@spark.apache.org)
> > For additional commands, e-mail: user-h...@spark.apache.org 
> > (mailto:user-h...@spark.apache.org)
> > 
> 
> 
> 




Re: groupBy gives non deterministic results

2014-09-10 Thread Ye Xianjin
|  Do the two mailing lists share messages ?
I don't think so.  I didn't receive this message from the user list. I am not 
in databricks, so I can't answer your other questions. Maybe Davies Liu 
 can answer you?

-- 
Ye Xianjin
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)


On Wednesday, September 10, 2014 at 9:05 PM, redocpot wrote:

> Hi, Xianjin
> 
> I checked user@spark.apache.org (mailto:user@spark.apache.org), and found my 
> post there:
> http://mail-archives.apache.org/mod_mbox/spark-user/201409.mbox/browser
> 
> I am using nabble to send this mail, which indicates that the mail will be
> sent from my email address to the u...@spark.incubator.apache.org 
> (mailto:u...@spark.incubator.apache.org) mailing
> list.
> 
> Do the two mailing lists share messages ?
> 
> Do we have a nabble interface for user@spark.apache.org 
> (mailto:user@spark.apache.org) mail list ?
> 
> Thank you.
> 
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/groupBy-gives-non-deterministic-results-tp13698p13876.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com 
> (http://Nabble.com).
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
> (mailto:user-unsubscr...@spark.apache.org)
> For additional commands, e-mail: user-h...@spark.apache.org 
> (mailto:user-h...@spark.apache.org)
> 
> 




Re: groupBy gives non deterministic results

2014-09-10 Thread Ye Xianjin
Great. And you should ask question in user@spark.apache.org mail list.  I 
believe many people don't subscribe the incubator mail list now.

-- 
Ye Xianjin
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)


On Wednesday, September 10, 2014 at 6:03 PM, redocpot wrote:

> Hi, 
> 
> I am using spark 1.0.0. The bug is fixed by 1.0.1.
> 
> Hao
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/groupBy-gives-non-deterministic-results-tp13698p13864.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com 
> (http://Nabble.com).
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
> (mailto:user-unsubscr...@spark.apache.org)
> For additional commands, e-mail: user-h...@spark.apache.org 
> (mailto:user-h...@spark.apache.org)
> 
> 




Re: groupBy gives non deterministic results

2014-09-09 Thread Ye Xianjin
Can you provide small sample or test data that reproduce this problem? and 
what's your env setup? single node or cluster?

Sent from my iPhone

> On 2014年9月8日, at 22:29, redocpot  wrote:
> 
> Hi,
> 
> I have a key-value RDD called rdd below. After a groupBy, I tried to count
> rows.
> But the result is not unique, somehow non deterministic.
> 
> Here is the test code:
> 
>  val step1 = ligneReceipt_cleTable.persist
>  val step2 = step1.groupByKey
> 
>  val s1size = step1.count
>  val s2size = step2.count
> 
>  val t = step2 // rdd after groupBy
> 
>  val t1 = t.count
>  val t2 = t.count
>  val t3 = t.count
>  val t4 = t.count
>  val t5 = t.count
>  val t6 = t.count
>  val t7 = t.count
>  val t8 = t.count
> 
>  println("s1size = " + s1size)
>  println("s2size = " + s2size)
>  println("1 => " + t1)
>  println("2 => " + t2)
>  println("3 => " + t3)
>  println("4 => " + t4)
>  println("5 => " + t5)
>  println("6 => " + t6)
>  println("7 => " + t7)
>  println("8 => " + t8)
> 
> Here are the results:
> 
> s1size = 5338864
> s2size = 5268001
> 1 => 5268002
> 2 => 5268001
> 3 => 5268001
> 4 => 5268002
> 5 => 5268001
> 6 => 5268002
> 7 => 5268002
> 8 => 5268001
> 
> Even if the difference is just one row, that's annoying.  
> 
> Any idea ?
> 
> Thank you.
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/groupBy-gives-non-deterministic-results-tp13698.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: distcp on ec2 standalone spark cluster

2014-09-08 Thread Ye Xianjin
well, this means you didn't start a compute cluster. Most likely because the 
wrong value of mapreduce.jobtracker.address cause the slave node cannot start 
the node manager. ( I am not familiar with the ec2 script, so I don't know 
whether the slave node has node manager installed or not.) 
Can you check the slave node the hadoop daemon log to see whether you started 
the nodemanager  but failed or there is no nodemanager to start? The log file 
location defaults to
/var/log/hadoop-xxx if my memory is correct.

Sent from my iPhone

> On 2014年9月9日, at 0:08, Tomer Benyamini  wrote:
> 
> No tasktracker or nodemanager. This is what I see:
> 
> On the master:
> 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager
> org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode
> org.apache.hadoop.hdfs.server.namenode.NameNode
> 
> On the data node (slave):
> 
> org.apache.hadoop.hdfs.server.datanode.DataNode
> 
> 
> 
>> On Mon, Sep 8, 2014 at 6:39 PM, Ye Xianjin  wrote:
>> what did you see in the log? was there anything related to mapreduce?
>> can you log into your hdfs (data) node, use jps to list all java process and
>> confirm whether there is a tasktracker process (or nodemanager) running with
>> datanode process
>> 
>> --
>> Ye Xianjin
>> Sent with Sparrow
>> 
>> On Monday, September 8, 2014 at 11:13 PM, Tomer Benyamini wrote:
>> 
>> Still no luck, even when running stop-all.sh followed by start-all.sh.
>> 
>> On Mon, Sep 8, 2014 at 5:57 PM, Nicholas Chammas
>>  wrote:
>> 
>> Tomer,
>> 
>> Did you try start-all.sh? It worked for me the last time I tried using
>> distcp, and it worked for this guy too.
>> 
>> Nick
>> 
>> 
>> On Mon, Sep 8, 2014 at 3:28 AM, Tomer Benyamini  wrote:
>> 
>> 
>> ~/ephemeral-hdfs/sbin/start-mapred.sh does not exist on spark-1.0.2;
>> 
>> I restarted hdfs using ~/ephemeral-hdfs/sbin/stop-dfs.sh and
>> ~/ephemeral-hdfs/sbin/start-dfs.sh, but still getting the same error
>> when trying to run distcp:
>> 
>> ERROR tools.DistCp (DistCp.java:run(126)) - Exception encountered
>> 
>> java.io.IOException: Cannot initialize Cluster. Please check your
>> configuration for mapreduce.framework.name and the correspond server
>> addresses.
>> 
>> at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:121)
>> 
>> at org.apache.hadoop.mapreduce.Cluster.(Cluster.java:83)
>> 
>> at org.apache.hadoop.mapreduce.Cluster.(Cluster.java:76)
>> 
>> at org.apache.hadoop.tools.DistCp.createMetaFolderPath(DistCp.java:352)
>> 
>> at org.apache.hadoop.tools.DistCp.execute(DistCp.java:146)
>> 
>> at org.apache.hadoop.tools.DistCp.run(DistCp.java:118)
>> 
>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>> 
>> at org.apache.hadoop.tools.DistCp.main(DistCp.java:374)
>> 
>> Any idea?
>> 
>> Thanks!
>> Tomer
>> 
>> On Sun, Sep 7, 2014 at 9:27 PM, Josh Rosen  wrote:
>> 
>> If I recall, you should be able to start Hadoop MapReduce using
>> ~/ephemeral-hdfs/sbin/start-mapred.sh.
>> 
>> On Sun, Sep 7, 2014 at 6:42 AM, Tomer Benyamini 
>> wrote:
>> 
>> 
>> Hi,
>> 
>> I would like to copy log files from s3 to the cluster's
>> ephemeral-hdfs. I tried to use distcp, but I guess mapred is not
>> running on the cluster - I'm getting the exception below.
>> 
>> Is there a way to activate it, or is there a spark alternative to
>> distcp?
>> 
>> Thanks,
>> Tomer
>> 
>> mapreduce.Cluster (Cluster.java:initialize(114)) - Failed to use
>> org.apache.hadoop.mapred.LocalClientProtocolProvider due to error:
>> Invalid "mapreduce.jobtracker.address" configuration value for
>> LocalJobRunner : "XXX:9001"
>> 
>> ERROR tools.DistCp (DistCp.java:run(126)) - Exception encountered
>> 
>> java.io.IOException: Cannot initialize Cluster. Please check your
>> configuration for mapreduce.framework.name and the correspond server
>> addresses.
>> 
>> at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:121)
>> 
>> at org.apache.hadoop.mapreduce.Cluster.(Cluster.java:83)
>> 
>> at org.apache.hadoop.mapreduce.Cluster.(Cluster.java:76)
>> 
>> at org.apache.hadoop.tools.DistCp.createMetaFolderPath(DistCp.java:352)
>> 
>> at org.apache.hadoop.tools.DistCp.execute(DistCp.java:146)
>> 
>> at org.apache.hadoop.tools.DistCp.run(DistCp.java:118)
>> 
>> at org.apa

Re: distcp on ec2 standalone spark cluster

2014-09-08 Thread Ye Xianjin
what did you see in the log? was there anything related to mapreduce?
can you log into your hdfs (data) node, use jps to list all java process and 
confirm whether there is a tasktracker process (or nodemanager) running with 
datanode process


-- 
Ye Xianjin
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)


On Monday, September 8, 2014 at 11:13 PM, Tomer Benyamini wrote:

> Still no luck, even when running stop-all.sh (http://stop-all.sh) followed by 
> start-all.sh (http://start-all.sh).
> 
> On Mon, Sep 8, 2014 at 5:57 PM, Nicholas Chammas
> mailto:nicholas.cham...@gmail.com)> wrote:
> > Tomer,
> > 
> > Did you try start-all.sh (http://start-all.sh)? It worked for me the last 
> > time I tried using
> > distcp, and it worked for this guy too.
> > 
> > Nick
> > 
> > 
> > On Mon, Sep 8, 2014 at 3:28 AM, Tomer Benyamini  > (mailto:tomer@gmail.com)> wrote:
> > > 
> > > ~/ephemeral-hdfs/sbin/start-mapred.sh (http://start-mapred.sh) does not 
> > > exist on spark-1.0.2;
> > > 
> > > I restarted hdfs using ~/ephemeral-hdfs/sbin/stop-dfs.sh 
> > > (http://stop-dfs.sh) and
> > > ~/ephemeral-hdfs/sbin/start-dfs.sh (http://start-dfs.sh), but still 
> > > getting the same error
> > > when trying to run distcp:
> > > 
> > > ERROR tools.DistCp (DistCp.java:run(126)) - Exception encountered
> > > 
> > > java.io.IOException: Cannot initialize Cluster. Please check your
> > > configuration for mapreduce.framework.name 
> > > (http://mapreduce.framework.name) and the correspond server
> > > addresses.
> > > 
> > > at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:121)
> > > 
> > > at org.apache.hadoop.mapreduce.Cluster.(Cluster.java:83)
> > > 
> > > at org.apache.hadoop.mapreduce.Cluster.(Cluster.java:76)
> > > 
> > > at org.apache.hadoop.tools.DistCp.createMetaFolderPath(DistCp.java:352)
> > > 
> > > at org.apache.hadoop.tools.DistCp.execute(DistCp.java:146)
> > > 
> > > at org.apache.hadoop.tools.DistCp.run(DistCp.java:118)
> > > 
> > > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
> > > 
> > > at org.apache.hadoop.tools.DistCp.main(DistCp.java:374)
> > > 
> > > Any idea?
> > > 
> > > Thanks!
> > > Tomer
> > > 
> > > On Sun, Sep 7, 2014 at 9:27 PM, Josh Rosen  > > (mailto:rosenvi...@gmail.com)> wrote:
> > > > If I recall, you should be able to start Hadoop MapReduce using
> > > > ~/ephemeral-hdfs/sbin/start-mapred.sh (http://start-mapred.sh).
> > > > 
> > > > On Sun, Sep 7, 2014 at 6:42 AM, Tomer Benyamini  > > > (mailto:tomer@gmail.com)>
> > > > wrote:
> > > > > 
> > > > > Hi,
> > > > > 
> > > > > I would like to copy log files from s3 to the cluster's
> > > > > ephemeral-hdfs. I tried to use distcp, but I guess mapred is not
> > > > > running on the cluster - I'm getting the exception below.
> > > > > 
> > > > > Is there a way to activate it, or is there a spark alternative to
> > > > > distcp?
> > > > > 
> > > > > Thanks,
> > > > > Tomer
> > > > > 
> > > > > mapreduce.Cluster (Cluster.java:initialize(114)) - Failed to use
> > > > > org.apache.hadoop.mapred.LocalClientProtocolProvider due to error:
> > > > > Invalid "mapreduce.jobtracker.address" configuration value for
> > > > > LocalJobRunner : "XXX:9001"
> > > > > 
> > > > > ERROR tools.DistCp (DistCp.java:run(126)) - Exception encountered
> > > > > 
> > > > > java.io.IOException: Cannot initialize Cluster. Please check your
> > > > > configuration for mapreduce.framework.name 
> > > > > (http://mapreduce.framework.name) and the correspond server
> > > > > addresses.
> > > > > 
> > > > > at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:121)
> > > > > 
> > > > > at org.apache.hadoop.mapreduce.Cluster.(Cluster.java:83)
> > > > > 
> > > > > at org.apache.hadoop.mapreduce.Cluster.(Cluster.java:76)
> > > > > 
> > > > > at 
> > > > > org.apache.hadoop.tools.DistCp.createMetaFolderPath(DistCp.java:352)
> > > > > 
> > > > > at org.apache.hadoop.tools.Di

Re: distcp on ec2 standalone spark cluster

2014-09-07 Thread Ye Xianjin
Distcp requires a mr1(or mr2) cluster to start. Do you have a mapreduce cluster 
on your hdfs? 
And from the error message, it seems that you didn't specify your jobtracker 
address.


-- 
Ye Xianjin
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)


On Sunday, September 7, 2014 at 9:42 PM, Tomer Benyamini wrote:

> Hi,
> 
> I would like to copy log files from s3 to the cluster's
> ephemeral-hdfs. I tried to use distcp, but I guess mapred is not
> running on the cluster - I'm getting the exception below.
> 
> Is there a way to activate it, or is there a spark alternative to distcp?
> 
> Thanks,
> Tomer
> 
> mapreduce.Cluster (Cluster.java:initialize(114)) - Failed to use
> org.apache.hadoop.mapred.LocalClientProtocolProvider due to error:
> Invalid "mapreduce.jobtracker.address" configuration value for
> LocalJobRunner : "XXX:9001"
> 
> ERROR tools.DistCp (DistCp.java:run(126)) - Exception encountered
> 
> java.io.IOException: Cannot initialize Cluster. Please check your
> configuration for mapreduce.framework.name (http://mapreduce.framework.name) 
> and the correspond server
> addresses.
> 
> at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:121)
> 
> at org.apache.hadoop.mapreduce.Cluster.(Cluster.java:83)
> 
> at org.apache.hadoop.mapreduce.Cluster.(Cluster.java:76)
> 
> at org.apache.hadoop.tools.DistCp.createMetaFolderPath(DistCp.java:352)
> 
> at org.apache.hadoop.tools.DistCp.execute(DistCp.java:146)
> 
> at org.apache.hadoop.tools.DistCp.run(DistCp.java:118)
> 
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
> 
> at org.apache.hadoop.tools.DistCp.main(DistCp.java:374)
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
> (mailto:user-unsubscr...@spark.apache.org)
> For additional commands, e-mail: user-h...@spark.apache.org 
> (mailto:user-h...@spark.apache.org)
> 
> 




Re: about spark assembly jar

2014-09-02 Thread Ye Xianjin
Sorry, The quick reply didn't cc the dev list.

Sean, sometimes I have to use the spark-shell to confirm some behavior change. 
In that case, I have to reassembly the whole project.  is there another way 
around, not use the the big jar in development? For the original question, I 
have no comments. 

-- 
Ye Xianjin
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)


On Tuesday, September 2, 2014 at 4:58 PM, Sean Owen wrote:

> No, usually you unit-test your changes during development. That
> doesn't require the assembly. Eventually you may wish to test some
> change against the complete assembly.
> 
> But that's a different question; I thought you were suggesting that
> the assembly JAR should never be created.
> 
> On Tue, Sep 2, 2014 at 9:53 AM, Ye Xianjin  (mailto:advance...@gmail.com)> wrote:
> > Hi, Sean:
> > In development, do I really need to reassembly the whole project even if I
> > only change a line or two code in one component?
> > I used to that but found time-consuming.
> > 
> > --
> > Ye Xianjin
> > Sent with Sparrow
> > 
> > On Tuesday, September 2, 2014 at 4:45 PM, Sean Owen wrote:
> > 
> > Hm, are you suggesting that the Spark distribution be a bag of 100
> > JARs? It doesn't quite seem reasonable. It does not remove version
> > conflicts, just pushes them to run-time, which isn't good. The
> > assembly is also necessary because that's where shading happens. In
> > development, you want to run against exactly what will be used in a
> > real Spark distro.
> > 
> > On Tue, Sep 2, 2014 at 9:39 AM, scwf  > (mailto:wangf...@huawei.com)> wrote:
> > 
> > hi, all
> > I suggest spark not use assembly jar as default run-time
> > dependency(spark-submit/spark-class depend on assembly jar),use a library of
> > all 3rd dependency jar like hadoop/hive/hbase more reasonable.
> > 
> > 1 assembly jar packaged all 3rd jars into a big one, so we need rebuild
> > this jar if we want to update the version of some component(such as hadoop)
> > 2 in our practice with spark, sometimes we meet jar compatibility issue,
> > it is hard to diagnose compatibility issue with assembly jar
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org 
> > (mailto:dev-unsubscr...@spark.apache.org)
> > For additional commands, e-mail: dev-h...@spark.apache.org 
> > (mailto:dev-h...@spark.apache.org)
> > 
> > 
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org 
> > (mailto:dev-unsubscr...@spark.apache.org)
> > For additional commands, e-mail: dev-h...@spark.apache.org 
> > (mailto:dev-h...@spark.apache.org)
> > 
> 
> 
> 




[jira] [Commented] (SPARK-3098) In some cases, operation zipWithIndex get a wrong results

2014-09-01 Thread Ye Xianjin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14117558#comment-14117558
 ] 

Ye Xianjin commented on SPARK-3098:
---

hi, [~srowen] and [~gq], I think what [~matei] wants to say is that because the 
ordering of elements in distinct() is not guaranteed, the result of 
zipWithIndex is not deterministic. If you recompute the RDD with distinct 
transformation, you are not guaranteed to get the same result. That explains 
the behavior here.

But as [~srowen] said, It's surprised to see different results from the same 
RDD. [~matei], what do you think about this behavior?

>  In some cases, operation zipWithIndex get a wrong results
> --
>
> Key: SPARK-3098
> URL: https://issues.apache.org/jira/browse/SPARK-3098
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.1
>Reporter: Guoqiang Li
>Priority: Critical
>
> The reproduce code:
> {code}
>  val c = sc.parallelize(1 to 7899).flatMap { i =>
>   (1 to 1).toSeq.map(p => i * 6000 + p)
> }.distinct().zipWithIndex() 
> c.join(c).filter(t => t._2._1 != t._2._2).take(3)
> {code}
>  => 
> {code}
>  Array[(Int, (Long, Long))] = Array((1732608,(11,12)), (45515264,(12,13)), 
> (36579712,(13,14)))
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: Too many open files

2014-08-29 Thread Ye Xianjin
Ops,the last reply didn't go to the user list.  Mail app's fault.

Shuffling happens in the cluster, so you need change all the nodes in the 
cluster.



Sent from my iPhone

> On 2014年8月30日, at 3:10, Sudha Krishna  wrote:
> 
> Hi,
> 
> Thanks for your response. Do you know if I need to change this limit on all 
> the cluster nodes or just the master?
> Thanks
> 
>> On Aug 29, 2014 11:43 AM, "Ye Xianjin"  wrote:
>> 1024 for the number of file limit is most likely too small for Linux 
>> Machines on production. Try to set to 65536 or unlimited if you can. The too 
>> many open files error occurs because there are a lot of shuffle files(if 
>> wrong, please correct me):
>> 
>> Sent from my iPhone
>> 
>> > On 2014年8月30日, at 2:06, SK  wrote:
>> >
>> > Hi,
>> >
>> > I am having the same problem reported by Michael. I am trying to open 30
>> > files. ulimit -n  shows the limit is 1024. So I am not sure why the program
>> > is failing with  "Too many open files" error. The total size of all the 30
>> > files is 230 GB.
>> > I am running the job on a cluster with 10 nodes, each having 16 GB. The
>> > error appears to be happening at the distinct() stage.
>> >
>> > Here is my program. In the following code, are all the 10 nodes trying to
>> > open all of the 30 files or are the files distributed among the 30 nodes?
>> >
>> >val baseFile = "/mapr/mapr_dir/files_2013apr*"
>> >valx = sc.textFile(baseFile)).map { line =>
>> >val
>> > fields = line.split("\t")
>> >
>> > (fields(11), fields(6))
>> >
>> > }.distinct().countByKey()
>> >val xrdd = sc.parallelize(x.toSeq)
>> >xrdd.saveAsTextFile(...)
>> >
>> > Instead of using the glob *, I guess I can try using a for loop to read the
>> > files one by one if that helps, but not sure if there is a more efficient
>> > solution.
>> >
>> > The following is the error transcript:
>> >
>> > Job aborted due to stage failure: Task 1.0:201 failed 4 times, most recent
>> > failure: Exception failure in TID 902 on host 192.168.13.11:
>> > java.io.FileNotFoundException:
>> > /tmp/spark-local-20140829131200-0bb7/08/shuffle_0_201_999 (Too many open
>> > files)
>> > java.io.FileOutputStream.open(Native Method)
>> > java.io.FileOutputStream.(FileOutputStream.java:221)
>> > org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:116)
>> > org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:177)
>> > org.apache.spark.scheduler.ShuffleMapTask$$anonfun$runTask$1.apply(ShuffleMapTask.scala:161)
>> > org.apache.spark.scheduler.ShuffleMapTask$$anonfun$runTask$1.apply(ShuffleMapTask.scala:158)
>> > scala.collection.Iterator$class.foreach(Iterator.scala:727)
>> > org.apache.spark.util.collection.AppendOnlyMap$$anon$1.foreach(AppendOnlyMap.scala:159)
>> > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158)
>> > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
>> > org.apache.spark.scheduler.Task.run(Task.scala:51)
>> > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183)
>> > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>> > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>> > java.lang.Thread.run(Thread.java:744) Driver stacktrace:
>> >
>> >
>> >
>> >
>> >
>> > --
>> > View this message in context: 
>> > http://apache-spark-user-list.1001560.n3.nabble.com/Too-many-open-files-tp1464p13144.html
>> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
>> >
>> > -
>> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> > For additional commands, e-mail: user-h...@spark.apache.org
>> >


Re: [VOTE] Release Apache Spark 1.1.0 (RC2)

2014-08-29 Thread Ye Xianjin
We just used CDH 4.7 for our production cluster. And I believe we won't use CDH 
5 in the next year.

Sent from my iPhone

> On 2014年8月29日, at 14:39, Matei Zaharia  wrote:
> 
> Personally I'd actually consider putting CDH4 back if there are still users 
> on it. It's always better to be inclusive, and the convenience of a one-click 
> download is high. Do we have a sense on what % of CDH users still use CDH4?
> 
> Matei
> 
> On August 28, 2014 at 11:31:13 PM, Sean Owen (so...@cloudera.com) wrote:
> 
> (Copying my reply since I don't know if it goes to the mailing list) 
> 
> Great, thanks for explaining the reasoning. You're saying these aren't 
> going into the final release? I think that moots any issue surrounding 
> distributing them then. 
> 
> This is all I know of from the ASF: 
> https://community.apache.org/projectIndependence.html I don't read it 
> as expressly forbidding this kind of thing although you can see how it 
> bumps up against the spirit. There's not a bright line -- what about 
> Tomcat providing binaries compiled for Windows for example? does that 
> favor an OS vendor? 
> 
> From this technical ASF perspective only the releases matter -- do 
> what you want with snapshots and RCs. The only issue there is maybe 
> releasing something different than was in the RC; is that at all 
> confusing? Just needs a note. 
> 
> I think this theoretical issue doesn't exist if these binaries aren't 
> released, so I see no reason to not proceed. 
> 
> The rest is a different question about whether you want to spend time 
> maintaining this profile and candidate. The vendor already manages 
> their build I think and -- and I don't know -- may even prefer not to 
> have a different special build floating around. There's also the 
> theoretical argument that this turns off other vendors from adopting 
> Spark if it's perceived to be too connected to other vendors. I'd like 
> to maximize Spark's distribution and there's some argument you do this 
> by not making vendor profiles. But as I say a different question to 
> just think about over time... 
> 
> (oh and PS for my part I think it's a good thing that CDH4 binaries 
> were removed. I wasn't arguing for resurrecting them) 
> 
>> On Fri, Aug 29, 2014 at 7:26 AM, Patrick Wendell  wrote: 
>> Hey Sean, 
>> 
>> The reason there are no longer CDH-specific builds is that all newer 
>> versions of CDH and HDP work with builds for the upstream Hadoop 
>> projects. I dropped CDH4 in favor of a newer Hadoop version (2.4) and 
>> the Hadoop-without-Hive (also 2.4) build. 
>> 
>> For MapR - we can't officially post those artifacts on ASF web space 
>> when we make the final release, we can only link to them as being 
>> hosted by MapR specifically since they use non-compatible licenses. 
>> However, I felt that providing these during a testing period was 
>> alright, with the goal of increasing test coverage. I couldn't find 
>> any policy against posting these on personal web space during RC 
>> voting. However, we can remove them if there is one. 
>> 
>> Dropping CDH4 was more because it is now pretty old, but we can add it 
>> back if people want. The binary packaging is a slightly separate 
>> question from release votes, so I can always add more binary packages 
>> whenever. And on this, my main concern is covering the most popular 
>> Hadoop versions to lower the bar for users to build and test Spark. 
>> 
>> - Patrick 
>> 
>>> On Thu, Aug 28, 2014 at 11:04 PM, Sean Owen  wrote: 
>>> +1 I tested the source and Hadoop 2.4 release. Checksums and 
>>> signatures are OK. Compiles fine with Java 8 on OS X. Tests... don't 
>>> fail any more than usual. 
>>> 
>>> FWIW I've also been using the 1.1.0-SNAPSHOT for some time in another 
>>> project and have encountered no problems. 
>>> 
>>> 
>>> I notice that the 1.1.0 release removes the CDH4-specific build, but 
>>> adds two MapR-specific builds. Compare with 
>>> https://dist.apache.org/repos/dist/release/spark/spark-1.0.2/ I 
>>> commented on the commit: 
>>> https://github.com/apache/spark/commit/ceb19830b88486faa87ff41e18d03ede713a73cc
>>>  
>>> 
>>> I'm in favor of removing all vendor-specific builds. This change 
>>> *looks* a bit funny as there was no JIRA (?) and appears to swap one 
>>> vendor for another. Of course there's nothing untoward going on, but 
>>> what was the reasoning? It's best avoided, and MapR already 
>>> distributes Spark just fine, no? 
>>> 
>>> This is a gray area with ASF projects. I mention it as well because it 
>>> came up with Apache Flink recently 
>>> (http://mail-archives.eu.apache.org/mod_mbox/incubator-flink-dev/201408.mbox/%3CCANC1h_u%3DN0YKFu3pDaEVYz5ZcQtjQnXEjQA2ReKmoS%2Bye7%3Do%3DA%40mail.gmail.com%3E)
>>> Another vendor rightly noted this could look like favoritism. They 
>>> changed to remove vendor releases. 
>>> 
 On Fri, Aug 29, 2014 at 3:14 AM, Patrick Wendell  
 wrote: 
 Please vote on releasing the following candidate as Apache Spark version 

[jira] [Created] (SPARK-3040) pick up a more proper local ip address for Utils.findLocalIpAddress method

2014-08-14 Thread Ye Xianjin (JIRA)
Ye Xianjin created SPARK-3040:
-

 Summary: pick up a more proper local ip address for 
Utils.findLocalIpAddress method
 Key: SPARK-3040
 URL: https://issues.apache.org/jira/browse/SPARK-3040
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.2
 Environment: Mac os x, a bunch of network interfaces: eth0, wlan0, 
vnic0, vnic1, tun0, lo
Reporter: Ye Xianjin
Priority: Trivial


I noticed this inconvenience when I ran spark-shell with my virtual machines on 
and VPN service running.

There are a lot of network interfaces on my laptop(inactive devices omitted):
{quote}
lo0: inet 127.0.0.1
en1: inet 192.168.0.102
vnic0: inet 10.211.55.2 (virtual if for vm1)
vnic1: inet 10.37.129.3 (virtual if for vm2)
tun0: inet 172.16.100.191 --> 172.16.100.191 (tun device for VPN)
{quote}

In spark core, Utils.findLocalIpAddress() uses 
NetworkInterface.getNetworkInterfaces to get all active network interfaces, but 
unfortunately, this method returns network interfaces in reverse order compared 
to the ifconfig output (both use ioctl sys call). I dug into the openJDK 6 and 
7 source code and confirms this behavior(It just happens on unix-like system, 
windows deals with it and returns in index order). So, the findLocalIpAddress 
method will pick the ip address associated with tun0 rather than en1




--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: defaultMinPartitions in textFile

2014-07-21 Thread Ye Xianjin
well, I think you miss this line of code in SparkContext.scala
line 1242-1243(master):
 /** Default min number of partitions for Hadoop RDDs when not given by user */
  def defaultMinPartitions: Int = math.min(defaultParallelism, 2)

so the defaultMinPartitions will be 2 unless the defaultParallelism is less 
than 2...


--  
Ye Xianjin
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)


On Tuesday, July 22, 2014 at 10:18 AM, Wang, Jensen wrote:

> Hi,  
> I started to use spark on yarn recently and found a problem while 
> tuning my program.
>   
> When SparkContext is initialized as sc and ready to read text file from hdfs, 
> the textFile(path, defaultMinPartitions) method is called.
> I traced down the second parameter in the spark source code and finally found 
> this:
>conf.getInt("spark.default.parallelism", math.max(totalCoreCount.get(), 
> 2))  in  CoarseGrainedSchedulerBackend.scala
>   
> I do not specify the property “spark.default.parallelism” anywhere so the 
> getInt will return value from the larger one between totalCoreCount and 2.
>   
> When I submit the application using spark-submit and specify the parameter: 
> --num-executors  2   --executor-cores 6, I suppose the totalCoreCount will be 
>  
> 2*6 = 12, so defaultMinPartitions will be 12.
>   
> But when I print the value of defaultMinPartitions in my program, I still get 
> 2 in return,  How does this happen, or where do I make a mistake?
>  
>  
>  




[jira] [Commented] (SPARK-2557) createTaskScheduler should be consistent between local and local-n-failures

2014-07-17 Thread Ye Xianjin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14065029#comment-14065029
 ] 

Ye Xianjin commented on SPARK-2557:
---

Github pr: https://github.com/apache/spark/pull/1464

> createTaskScheduler should be consistent between local and local-n-failures 
> 
>
> Key: SPARK-2557
> URL: https://issues.apache.org/jira/browse/SPARK-2557
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Ye Xianjin
>Priority: Minor
>  Labels: starter
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> In SparkContext.createTaskScheduler, we can use {code}local[*]{code} to 
> estimates the number of cores on the machine. I think we should also be able 
> to use * in the local-n-failures mode.
> And according to the code in the LOCAL_N_REGEX pattern matching code, I 
> believe the regular expression of LOCAL_N_REGEX is wrong. LOCAL_N_REFEX 
> should be 
> {code}
> """local\[([0-9]+|\*)\]""".r
> {code} 
> rather than
> {code}
>  """local\[([0-9\*]+)\]""".r
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2557) createTaskScheduler should be consistent between local and local-n-failures

2014-07-17 Thread Ye Xianjin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14065001#comment-14065001
 ] 

Ye Xianjin commented on SPARK-2557:
---

I will send a pr for this.

> createTaskScheduler should be consistent between local and local-n-failures 
> 
>
> Key: SPARK-2557
> URL: https://issues.apache.org/jira/browse/SPARK-2557
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Ye Xianjin
>Priority: Minor
>  Labels: starter
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> In SparkContext.createTaskScheduler, we can use {code}local[*]{code} to 
> estimates the number of cores on the machine. I think we should also be able 
> to use * in the local-n-failures mode.
> And according to the code in the LOCAL_N_REGEX pattern matching code, I 
> believe the regular expression of LOCAL_N_REGEX is wrong. LOCAL_N_REFEX 
> should be 
> {code}
> """local\[([0-9]+|\*)\]""".r
> {code} 
> rather than
> {code}
>  """local\[([0-9\*]+)\]""".r
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2557) createTaskScheduler should be consistent between local and local-n-failures

2014-07-17 Thread Ye Xianjin (JIRA)
Ye Xianjin created SPARK-2557:
-

 Summary: createTaskScheduler should be consistent between local 
and local-n-failures 
 Key: SPARK-2557
 URL: https://issues.apache.org/jira/browse/SPARK-2557
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Ye Xianjin
Priority: Minor


In SparkContext.createTaskScheduler, we can use {code}local[*]{code} to 
estimates the number of cores on the machine. I think we should also be able to 
use * in the local-n-failures mode.

And according to the code in the LOCAL_N_REGEX pattern matching code, I believe 
the regular expression of LOCAL_N_REGEX is wrong. LOCAL_N_REFEX should be 
{code}
"""local\[([0-9]+|\*)\]""".r
{code} 
rather than
{code}
 """local\[([0-9\*]+)\]""".r
{code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Re: Where to set proxy in order to run ./install-dev.sh for SparkR

2014-07-02 Thread Ye Xianjin
You can try setting your HTTP_PROXY environment variable.

export HTTP_PROXY=host:port

But I don't use maven. If the env variable doesn't work, please search google 
for maven proxy. I am sure there will be a lot of related results.

Sent from my iPhone

> On 2014年7月2日, at 19:04, Stuti Awasthi  wrote:
> 
> Hi,
>  
> I wanted to build SparkR from source but running the script behind the proxy. 
> Where shall I set proxy host and port in order to build the source. Issue is 
> not able to download dependencies from Maven
>  
> Thanks
> Stuti Awasthi
>  
> 
> 
> ::DISCLAIMER::
> 
> The contents of this e-mail and any attachment(s) are confidential and 
> intended for the named recipient(s) only.
> E-mail transmission is not guaranteed to be secure or error-free as 
> information could be intercepted, corrupted, 
> lost, destroyed, arrive late or incomplete, or may contain viruses in 
> transmission. The e mail and its contents 
> (with or without referred errors) shall therefore not attach any liability on 
> the originator or HCL or its affiliates. 
> Views or opinions, if any, presented in this email are solely those of the 
> author and may not necessarily reflect the 
> views or opinions of HCL or its affiliates. Any form of reproduction, 
> dissemination, copying, disclosure, modification, 
> distribution and / or publication of this message without the prior written 
> consent of authorized representative of 
> HCL is strictly prohibited. If you have received this email in error please 
> delete it and notify the sender immediately. 
> Before opening any email and/or attachments, please check them for viruses 
> and other defects.
> 


Re: Set comparison

2014-06-16 Thread Ye Xianjin
If you want string with quotes, you have to escape it with '\'. It's exactly 
what you did in the modified version.

Sent from my iPhone

> On 2014年6月17日, at 5:43, SK  wrote:
> 
> In Line 1, I have expected_res as a set of strings with quotes. So I thought
> it would include the quotes during comparison.
> 
> Anyway I modified expected_res = Set("\"ID1\"", "\"ID2\"", "\"ID3\"") and
> that seems to work.
> 
> thanks.
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Set-comparison-tp7696p7699.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.


[jira] [Commented] (SPARK-1527) rootDirs in DiskBlockManagerSuite doesn't get full path from rootDir0, rootDir1

2014-04-23 Thread Ye Xianjin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13979309#comment-13979309
 ] 

Ye Xianjin commented on SPARK-1527:
---

hi, [~nirajsuthar]
This is my pr. https://github.com/apache/spark/pull/436. As [~srowen] said, if 
someone interested in changing the toString()s, please leave a comment.

[~rxin] what do you think?

> rootDirs in DiskBlockManagerSuite doesn't get full path from rootDir0, 
> rootDir1
> ---
>
> Key: SPARK-1527
> URL: https://issues.apache.org/jira/browse/SPARK-1527
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 0.9.0
>Reporter: Ye Xianjin
>Assignee: Niraj Suthar
>Priority: Minor
>  Labels: starter
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> In core/src/test/scala/org/apache/storage/DiskBlockManagerSuite.scala
>   val rootDir0 = Files.createTempDir()
>   rootDir0.deleteOnExit()
>   val rootDir1 = Files.createTempDir()
>   rootDir1.deleteOnExit()
>   val rootDirs = rootDir0.getName + "," + rootDir1.getName
> rootDir0 and rootDir1 are in system's temporary directory. 
> rootDir0.getName will not get the full path of the directory but the last 
> component of the directory. When passing to DiskBlockManage constructor, the 
> DiskBlockerManger creates directories in pwd not the temporary directory.
> rootDir0.toString will fix this issue.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1527) rootDirs in DiskBlockManagerSuite doesn't get full path from rootDir0, rootDir1

2014-04-17 Thread Ye Xianjin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13973096#comment-13973096
 ] 

Ye Xianjin commented on SPARK-1527:
---

Yes, of course, sometimes we want absolute path, sometimes we want to transmit 
a relative path. It depends on logic. 
But I think maybe we should review these usages so that we can make sure 
absolute paths or relative paths are used appropriately.

I may have time to review it after I finish another JIRA issue. If you want to 
take it over, please!

Anyway, thanks for your comments and help.


> rootDirs in DiskBlockManagerSuite doesn't get full path from rootDir0, 
> rootDir1
> ---
>
> Key: SPARK-1527
> URL: https://issues.apache.org/jira/browse/SPARK-1527
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 0.9.0
>Reporter: Ye Xianjin
>Priority: Minor
>  Labels: starter
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> In core/src/test/scala/org/apache/storage/DiskBlockManagerSuite.scala
>   val rootDir0 = Files.createTempDir()
>   rootDir0.deleteOnExit()
>   val rootDir1 = Files.createTempDir()
>   rootDir1.deleteOnExit()
>   val rootDirs = rootDir0.getName + "," + rootDir1.getName
> rootDir0 and rootDir1 are in system's temporary directory. 
> rootDir0.getName will not get the full path of the directory but the last 
> component of the directory. When passing to DiskBlockManage constructor, the 
> DiskBlockerManger creates directories in pwd not the temporary directory.
> rootDir0.toString will fix this issue.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1527) rootDirs in DiskBlockManagerSuite doesn't get full path from rootDir0, rootDir1

2014-04-17 Thread Ye Xianjin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13973087#comment-13973087
 ] 

Ye Xianjin commented on SPARK-1527:
---

Yes. You are right. toString() may give relative path. And since it's 
determined by java.io.tmpdir system property. see 
https://code.google.com/p/guava-libraries/source/browse/guava/src/com/google/common/io/Files.java
 line 591. It's possible that the DiskBlockManager will create different 
directories than the original temp dir when java.io.tmpdir is a relative path. 

so use getAbsolutePath since I use this method in my last pr?

But, I saw toString() was called other places! Should we do something about 
that?

> rootDirs in DiskBlockManagerSuite doesn't get full path from rootDir0, 
> rootDir1
> ---
>
> Key: SPARK-1527
> URL: https://issues.apache.org/jira/browse/SPARK-1527
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>    Affects Versions: 0.9.0
>Reporter: Ye Xianjin
>Priority: Minor
>  Labels: starter
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> In core/src/test/scala/org/apache/storage/DiskBlockManagerSuite.scala
>   val rootDir0 = Files.createTempDir()
>   rootDir0.deleteOnExit()
>   val rootDir1 = Files.createTempDir()
>   rootDir1.deleteOnExit()
>   val rootDirs = rootDir0.getName + "," + rootDir1.getName
> rootDir0 and rootDir1 are in system's temporary directory. 
> rootDir0.getName will not get the full path of the directory but the last 
> component of the directory. When passing to DiskBlockManage constructor, the 
> DiskBlockerManger creates directories in pwd not the temporary directory.
> rootDir0.toString will fix this issue.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1527) rootDirs in DiskBlockManagerSuite doesn't get full path from rootDir0, rootDir1

2014-04-17 Thread Ye Xianjin (JIRA)
Ye Xianjin created SPARK-1527:
-

 Summary: rootDirs in DiskBlockManagerSuite doesn't get full path 
from rootDir0, rootDir1
 Key: SPARK-1527
 URL: https://issues.apache.org/jira/browse/SPARK-1527
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.9.0
Reporter: Ye Xianjin
Priority: Minor


In core/src/test/scala/org/apache/storage/DiskBlockManagerSuite.scala

  val rootDir0 = Files.createTempDir()
  rootDir0.deleteOnExit()
  val rootDir1 = Files.createTempDir()
  rootDir1.deleteOnExit()
  val rootDirs = rootDir0.getName + "," + rootDir1.getName

rootDir0 and rootDir1 are in system's temporary directory. 
rootDir0.getName will not get the full path of the directory but the last 
component of the directory. When passing to DiskBlockManage constructor, the 
DiskBlockerManger creates directories in pwd not the temporary directory.

rootDir0.toString will fix this issue.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Closed] (SPARK-1511) Update TestUtils.createCompiledClass() API to work with creating class file on different filesystem

2014-04-17 Thread Ye Xianjin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ye Xianjin closed SPARK-1511.
-

   Resolution: Fixed
Fix Version/s: 1.0.0

> Update TestUtils.createCompiledClass() API to work with creating class file 
> on different filesystem
> ---
>
> Key: SPARK-1511
> URL: https://issues.apache.org/jira/browse/SPARK-1511
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 0.8.1, 0.9.0, 1.0.0
> Environment: Mac OS X, two disks. 
>Reporter: Ye Xianjin
>Priority: Minor
>  Labels: starter
> Fix For: 1.0.0
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> The createCompliedClass method uses java File.renameTo method to rename 
> source file to destination file, which will fail if source and destination 
> files are on different disks (or partitions).
> see 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Tests-failed-after-assembling-the-latest-code-from-github-td6315.html
>  for more details.
> Use com.google.common.io.Files.move instead of renameTo will solve this issue.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1511) Update TestUtils.createCompiledClass() API to work with creating class file on different filesystem

2014-04-17 Thread Ye Xianjin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13972962#comment-13972962
 ] 

Ye Xianjin commented on SPARK-1511:
---

Close this issue.
pr https://github.com/apache/spark/pull/427 solves it.

> Update TestUtils.createCompiledClass() API to work with creating class file 
> on different filesystem
> ---
>
> Key: SPARK-1511
> URL: https://issues.apache.org/jira/browse/SPARK-1511
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 0.8.1, 0.9.0, 1.0.0
> Environment: Mac OS X, two disks. 
>Reporter: Ye Xianjin
>Priority: Minor
>  Labels: starter
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> The createCompliedClass method uses java File.renameTo method to rename 
> source file to destination file, which will fail if source and destination 
> files are on different disks (or partitions).
> see 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Tests-failed-after-assembling-the-latest-code-from-github-td6315.html
>  for more details.
> Use com.google.common.io.Files.move instead of renameTo will solve this issue.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1511) Update TestUtils.createCompiledClass() API to work with creating class file on different filesystem

2014-04-16 Thread Ye Xianjin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ye Xianjin updated SPARK-1511:
--

Affects Version/s: 0.8.1
   0.9.0

> Update TestUtils.createCompiledClass() API to work with creating class file 
> on different filesystem
> ---
>
> Key: SPARK-1511
> URL: https://issues.apache.org/jira/browse/SPARK-1511
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 0.8.1, 0.9.0, 1.0.0
> Environment: Mac OS X, two disks. 
>Reporter: Ye Xianjin
>Priority: Minor
>  Labels: starter
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> The createCompliedClass method uses java File.renameTo method to rename 
> source file to destination file, which will fail if source and destination 
> files are on different disks (or partitions).
> see 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Tests-failed-after-assembling-the-latest-code-from-github-td6315.html
>  for more details.
> Use com.google.common.io.Files.move instead of renameTo will solve this issue.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1511) Update TestUtils.createCompiledClass() API to work with creating class file on different filesystem

2014-04-16 Thread Ye Xianjin (JIRA)
Ye Xianjin created SPARK-1511:
-

 Summary: Update TestUtils.createCompiledClass() API to work with 
creating class file on different filesystem
 Key: SPARK-1511
 URL: https://issues.apache.org/jira/browse/SPARK-1511
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
 Environment: Mac OS X, two disks. 
Reporter: Ye Xianjin
Priority: Minor


The createCompliedClass method uses java File.renameTo method to rename source 
file to destination file, which will fail if source and destination files are 
on different disks (or partitions).

see 
http://apache-spark-developers-list.1001551.n3.nabble.com/Tests-failed-after-assembling-the-latest-code-from-github-td6315.html
 for more details.

Use com.google.common.io.Files.move instead of renameTo will solve this issue.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Re: Tests failed after assembling the latest code from github

2014-04-14 Thread Ye Xianjin
@Sean Owen, Thanks for your advice.
 There are still some failing tests on my laptop. I will work on this 
issue(file move) as soon as I figure out other test related issues.


-- 
Ye Xianjin
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)


On Tuesday, April 15, 2014 at 2:41 PM, Sean Owen wrote:

> Good call -- indeed that same Files class has a move() method that
> will try to use renameTo() and then fall back to copy() and delete()
> if needed for this very reason.
> 
> 
> On Tue, Apr 15, 2014 at 6:34 AM, Ye Xianjin  (mailto:advance...@gmail.com)> wrote:
> > Hi, I think I have found the cause of the tests failing.
> > 
> > I have two disks on my laptop. The spark project dir is on an HDD disk 
> > while the tempdir created by google.io.Files.createTempDir is the 
> > /var/folders/5q/ ,which is on the system disk, an SSD.
> > The ExecutorLoaderSuite test uses 
> > org.apache.spark.TestUtils.createdCompiledClass methods.
> > The createCompiledClass method first generates the compiled class in the 
> > pwd(spark/repl), thens use renameTo to move
> > the file. The renameTo method fails because the dest file is in a different 
> > filesystem than the source file.
> > 
> > I modify the TestUtils.scala to first copy the file to dest then delete the 
> > original file. The tests go smoothly.
> > Should I issue an jira about this problem? Then I can send a pr on Github.
> > 
> 
> 
> 




Re: Tests failed after assembling the latest code from github

2014-04-14 Thread Ye Xianjin
Hi, I think I have found the cause of the tests failing. 

I have two disks on my laptop. The spark project dir is on an HDD disk while 
the tempdir created by google.io.Files.createTempDir is the 
/var/folders/5q/ ,which is on the system disk, an SSD.
The ExecutorLoaderSuite test uses 
org.apache.spark.TestUtils.createdCompiledClass methods.
The createCompiledClass method first generates the compiled class in the 
pwd(spark/repl), thens use renameTo to move
the file. The renameTo method fails because the dest file is in a different 
filesystem than the source file.

I modify the TestUtils.scala to first copy the file to dest then delete the 
original file. The tests go smoothly.
Should I issue an jira about this problem? Then I can send a pr on Github.

-- 
Ye Xianjin
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)


On Tuesday, April 15, 2014 at 3:43 AM, Ye Xianjin wrote:

> well. This is very strange. 
> I looked into ExecutorClassLoaderSuite.scala and ReplSuite.scala and made 
> small changes to ExecutorClassLoaderSuite.scala (mostly output some internal 
> variables). After that, when running repl test, I noticed the ReplSuite  
> was tested first and the test result is ok. But the ExecutorClassLoaderSuite 
> test was weird.
> Here is the output:
> [info] ExecutorClassLoaderSuite:
> [error] Uncaught exception when running 
> org.apache.spark.repl.ExecutorClassLoaderSuite: java.lang.OutOfMemoryError: 
> PermGen space
> [error] Uncaught exception when running 
> org.apache.spark.repl.ExecutorClassLoaderSuite: java.lang.OutOfMemoryError: 
> PermGen space
> Internal error when running tests: java.lang.OutOfMemoryError: PermGen space
> Exception in thread "Thread-3" java.io.EOFException
> at 
> java.io.ObjectInputStream$BlockDataInputStream.peekByte(ObjectInputStream.java:2577)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1297)
> at java.io.ObjectInputStream.readArray(ObjectInputStream.java:1685)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1323)
> at java.io.ObjectInputStream.readObject(ObjectInputStream.java:349)
> at sbt.React.react(ForkTests.scala:116)
> at sbt.ForkTests$$anonfun$mainTestTask$1$Acceptor$2$.run(ForkTests.scala:75)
> at java.lang.Thread.run(Thread.java:695)
> 
> 
> I revert my changes. The test result is same.
> 
>  I touched the ReplSuite.scala file (use touch command), the test order is 
> reversed, same as the very beginning. And the output is also the same.(The 
> result in my first post).
> 
> 
> -- 
> Ye Xianjin
> Sent with Sparrow (http://www.sparrowmailapp.com/?sig)
> 
> 
> On Tuesday, April 15, 2014 at 3:14 AM, Aaron Davidson wrote:
> 
> > This may have something to do with running the tests on a Mac, as there is
> > a lot of File/URI/URL stuff going on in that test which may just have
> > happened to work if run on a Linux system (like Jenkins). Note that this
> > suite was added relatively recently:
> > https://github.com/apache/spark/pull/217
> > 
> > 
> > On Mon, Apr 14, 2014 at 12:04 PM, Ye Xianjin  > (mailto:advance...@gmail.com)> wrote:
> > 
> > > Thank you for your reply.
> > > 
> > > After building the assembly jar, the repl test still failed. The error
> > > output is same as I post before.
> > > 
> > > --
> > > Ye Xianjin
> > > Sent with Sparrow (http://www.sparrowmailapp.com/?sig)
> > > 
> > > 
> > > On Tuesday, April 15, 2014 at 1:39 AM, Michael Armbrust wrote:
> > > 
> > > > I believe you may need an assembly jar to run the ReplSuite. "sbt/sbt
> > > > assembly/assembly".
> > > > 
> > > > Michael
> > > > 
> > > > 
> > > > On Mon, Apr 14, 2014 at 3:14 AM, Ye Xianjin  > > > (mailto:advance...@gmail.com)(mailto:
> > > advance...@gmail.com (mailto:advance...@gmail.com))> wrote:
> > > > 
> > > > > Hi, everyone:
> > > > > I am new to Spark development. I download spark's latest code from
> > > > > 
> > > > 
> > > > 
> > > 
> > > github.
> > > > > After running sbt/sbt assembly,
> > > > > I began running sbt/sbt test in the spark source code dir. But it
> > > > > 
> > > > 
> > > 
> > > failed
> > > > > running the repl module test.
> > > > > 
> > > > > Here are some output details.
> > > > > 
> > > > > command:
> > > > > sbt/sbt "test-only org.apache.spark.repl.*"
> > > > > output:
> > > > > 
> > > &

Re: Tests failed after assembling the latest code from github

2014-04-14 Thread Ye Xianjin
well. This is very strange. 
I looked into ExecutorClassLoaderSuite.scala and ReplSuite.scala and made small 
changes to ExecutorClassLoaderSuite.scala (mostly output some internal 
variables). After that, when running repl test, I noticed the ReplSuite  
was tested first and the test result is ok. But the ExecutorClassLoaderSuite 
test was weird.
Here is the output:
[info] ExecutorClassLoaderSuite:
[error] Uncaught exception when running 
org.apache.spark.repl.ExecutorClassLoaderSuite: java.lang.OutOfMemoryError: 
PermGen space
[error] Uncaught exception when running 
org.apache.spark.repl.ExecutorClassLoaderSuite: java.lang.OutOfMemoryError: 
PermGen space
Internal error when running tests: java.lang.OutOfMemoryError: PermGen space
Exception in thread "Thread-3" java.io.EOFException
at 
java.io.ObjectInputStream$BlockDataInputStream.peekByte(ObjectInputStream.java:2577)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1297)
at java.io.ObjectInputStream.readArray(ObjectInputStream.java:1685)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1323)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:349)
at sbt.React.react(ForkTests.scala:116)
at sbt.ForkTests$$anonfun$mainTestTask$1$Acceptor$2$.run(ForkTests.scala:75)
at java.lang.Thread.run(Thread.java:695)


I revert my changes. The test result is same.

 I touched the ReplSuite.scala file (use touch command), the test order is 
reversed, same as the very beginning. And the output is also the same.(The 
result in my first post).


-- 
Ye Xianjin
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)


On Tuesday, April 15, 2014 at 3:14 AM, Aaron Davidson wrote:

> This may have something to do with running the tests on a Mac, as there is
> a lot of File/URI/URL stuff going on in that test which may just have
> happened to work if run on a Linux system (like Jenkins). Note that this
> suite was added relatively recently:
> https://github.com/apache/spark/pull/217
> 
> 
> On Mon, Apr 14, 2014 at 12:04 PM, Ye Xianjin  (mailto:advance...@gmail.com)> wrote:
> 
> > Thank you for your reply.
> > 
> > After building the assembly jar, the repl test still failed. The error
> > output is same as I post before.
> > 
> > --
> > Ye Xianjin
> > Sent with Sparrow (http://www.sparrowmailapp.com/?sig)
> > 
> > 
> > On Tuesday, April 15, 2014 at 1:39 AM, Michael Armbrust wrote:
> > 
> > > I believe you may need an assembly jar to run the ReplSuite. "sbt/sbt
> > > assembly/assembly".
> > > 
> > > Michael
> > > 
> > > 
> > > On Mon, Apr 14, 2014 at 3:14 AM, Ye Xianjin  > > (mailto:advance...@gmail.com)(mailto:
> > advance...@gmail.com (mailto:advance...@gmail.com))> wrote:
> > > 
> > > > Hi, everyone:
> > > > I am new to Spark development. I download spark's latest code from
> > > > 
> > > 
> > > 
> > 
> > github.
> > > > After running sbt/sbt assembly,
> > > > I began running sbt/sbt test in the spark source code dir. But it
> > > > 
> > > 
> > 
> > failed
> > > > running the repl module test.
> > > > 
> > > > Here are some output details.
> > > > 
> > > > command:
> > > > sbt/sbt "test-only org.apache.spark.repl.*"
> > > > output:
> > > > 
> > > > [info] Loading project definition from
> > > > /Volumes/MacintoshHD/github/spark/project/project
> > > > [info] Loading project definition from
> > > > /Volumes/MacintoshHD/github/spark/project
> > > > [info] Set current project to root (in build
> > > > file:/Volumes/MacintoshHD/github/spark/)
> > > > [info] Passed: Total 0, Failed 0, Errors 0, Passed 0
> > > > [info] No tests to run for graphx/test:testOnly
> > > > [info] Passed: Total 0, Failed 0, Errors 0, Passed 0
> > > > [info] No tests to run for bagel/test:testOnly
> > > > [info] Passed: Total 0, Failed 0, Errors 0, Passed 0
> > > > [info] No tests to run for streaming/test:testOnly
> > > > [info] Passed: Total 0, Failed 0, Errors 0, Passed 0
> > > > [info] No tests to run for mllib/test:testOnly
> > > > [info] Passed: Total 0, Failed 0, Errors 0, Passed 0
> > > > [info] No tests to run for catalyst/test:testOnly
> > > > [info] Passed: Total 0, Failed 0, Errors 0, Passed 0
> > > > [info] No tests to run for core/test:testOnly
> > > > [info] Passed: Total 0, Failed 0, Errors 0, Passed 0
> > > > [info] No tests to run for assembly/test:testOnly
> > > 

Re: Tests failed after assembling the latest code from github

2014-04-14 Thread Ye Xianjin
Thank you for your reply. 

After building the assembly jar, the repl test still failed. The error output 
is same as I post before. 

-- 
Ye Xianjin
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)


On Tuesday, April 15, 2014 at 1:39 AM, Michael Armbrust wrote:

> I believe you may need an assembly jar to run the ReplSuite. "sbt/sbt
> assembly/assembly".
> 
> Michael
> 
> 
> On Mon, Apr 14, 2014 at 3:14 AM, Ye Xianjin  (mailto:advance...@gmail.com)> wrote:
> 
> > Hi, everyone:
> > I am new to Spark development. I download spark's latest code from github.
> > After running sbt/sbt assembly,
> > I began running sbt/sbt test in the spark source code dir. But it failed
> > running the repl module test.
> > 
> > Here are some output details.
> > 
> > command:
> > sbt/sbt "test-only org.apache.spark.repl.*"
> > output:
> > 
> > [info] Loading project definition from
> > /Volumes/MacintoshHD/github/spark/project/project
> > [info] Loading project definition from
> > /Volumes/MacintoshHD/github/spark/project
> > [info] Set current project to root (in build
> > file:/Volumes/MacintoshHD/github/spark/)
> > [info] Passed: Total 0, Failed 0, Errors 0, Passed 0
> > [info] No tests to run for graphx/test:testOnly
> > [info] Passed: Total 0, Failed 0, Errors 0, Passed 0
> > [info] No tests to run for bagel/test:testOnly
> > [info] Passed: Total 0, Failed 0, Errors 0, Passed 0
> > [info] No tests to run for streaming/test:testOnly
> > [info] Passed: Total 0, Failed 0, Errors 0, Passed 0
> > [info] No tests to run for mllib/test:testOnly
> > [info] Passed: Total 0, Failed 0, Errors 0, Passed 0
> > [info] No tests to run for catalyst/test:testOnly
> > [info] Passed: Total 0, Failed 0, Errors 0, Passed 0
> > [info] No tests to run for core/test:testOnly
> > [info] Passed: Total 0, Failed 0, Errors 0, Passed 0
> > [info] No tests to run for assembly/test:testOnly
> > [info] Passed: Total 0, Failed 0, Errors 0, Passed 0
> > [info] No tests to run for sql/test:testOnly
> > [info] ExecutorClassLoaderSuite:
> > 2014-04-14 16:59:31.247 java[8393:1003] Unable to load realm info from
> > SCDynamicStore
> > [info] - child first *** FAILED *** (440 milliseconds)
> > [info] java.lang.ClassNotFoundException: ReplFakeClass2
> > [info] at java.lang.ClassLoader.findClass(ClassLoader.java:364)
> > [info] at
> > org.apache.spark.util.ParentClassLoader.findClass(ParentClassLoader.scala:26)
> > [info] at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
> > [info] at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
> > [info] at
> > org.apache.spark.util.ParentClassLoader.loadClass(ParentClassLoader.scala:30)
> > [info] at
> > org.apache.spark.repl.ExecutorClassLoader$$anonfun$findClass$1.apply(ExecutorClassLoader.scala:57)
> > [info] at
> > org.apache.spark.repl.ExecutorClassLoader$$anonfun$findClass$1.apply(ExecutorClassLoader.scala:57)
> > [info] at scala.Option.getOrElse(Option.scala:120)
> > [info] at
> > org.apache.spark.repl.ExecutorClassLoader.findClass(ExecutorClassLoader.scala:57)
> > [info] at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
> > [info] at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
> > [info] at
> > org.apache.spark.repl.ExecutorClassLoaderSuite$$anonfun$1.apply$mcV$sp(ExecutorClassLoaderSuite.scala:47)
> > [info] at
> > org.apache.spark.repl.ExecutorClassLoaderSuite$$anonfun$1.apply(ExecutorClassLoaderSuite.scala:44)
> > [info] at
> > org.apache.spark.repl.ExecutorClassLoaderSuite$$anonfun$1.apply(ExecutorClassLoaderSuite.scala:44)
> > [info] at org.scalatest.FunSuite$$anon$1.apply(FunSuite.scala:1265)
> > [info] at org.scalatest.Suite$class.withFixture(Suite.scala:1974)
> > [info] at
> > org.apache.spark.repl.ExecutorClassLoaderSuite.withFixture(ExecutorClassLoaderSuite.scala:30)
> > [info] at
> > org.scalatest.FunSuite$class.invokeWithFixture$1(FunSuite.scala:1262)
> > [info] at
> > org.scalatest.FunSuite$$anonfun$runTest$1.apply(FunSuite.scala:1271)
> > [info] at
> > org.scalatest.FunSuite$$anonfun$runTest$1.apply(FunSuite.scala:1271)
> > [info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:198)
> > [info] at org.scalatest.FunSuite$class.runTest(FunSuite.scala:1271)
> > [info] at
> > org.apache.spark.repl.ExecutorClassLoaderSuite.runTest(ExecutorClassLoaderSuite.scala:30)
> > [info] at
> > org.scalatest.FunSuite$$anonfun$runTests$1.apply(FunSuite.scala:1304)
> > [info] at
> > org.scalatest.FunSuite$$anonfun$runTests$1.apply(FunSuite.scala:1304)
> > 

Tests failed after assembling the latest code from github

2014-04-14 Thread Ye Xianjin
rAll$$super$run(ExecutorClassLoaderSuite.scala:30)
[info]   at 
org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:213)
[info]   at 
org.apache.spark.repl.ExecutorClassLoaderSuite.run(ExecutorClassLoaderSuite.scala:30)
[info]   at 
org.scalatest.tools.ScalaTestFramework$ScalaTestRunner.run(ScalaTestFramework.scala:214)
[info]   at sbt.RunnerWrapper$1.runRunner2(FrameworkWrapper.java:220)
[info]   at sbt.RunnerWrapper$1.execute(FrameworkWrapper.java:233)
[info]   at sbt.ForkMain$Run.runTest(ForkMain.java:243)
[info]   at sbt.ForkMain$Run.runTestSafe(ForkMain.java:214)
[info]   at sbt.ForkMain$Run.runTests(ForkMain.java:190)
[info]   at sbt.ForkMain$Run.run(ForkMain.java:257)
[info]   at sbt.ForkMain.main(ForkMain.java:99)
[info] - parent first *** FAILED *** (59 milliseconds)
[info]   java.lang.ClassNotFoundException: ReplFakeClass1
...
[info]   Cause: java.lang.ClassNotFoundException: ReplFakeClass1
...
[info] - child first can fall back *** FAILED *** (39 milliseconds)
[info]   java.lang.ClassNotFoundException: ReplFakeClass3
...
[info] - child first can fail (46 milliseconds)
[info] ReplSuite:
[info] - propagation of local properties (9 seconds, 353 milliseconds)
[info] - simple foreach with accumulator (7 seconds, 608 milliseconds)
[info] - external vars (5 seconds, 783 milliseconds)
[info] - external classes (4 seconds, 341 milliseconds)
[info] - external functions (4 seconds, 106 milliseconds)
[info] - external functions that access vars (4 seconds, 538 milliseconds)
[info] - broadcast vars (4 seconds, 155 milliseconds)
[info] - interacting with files (3 seconds, 376 milliseconds)
Exception in thread "Connection manager future execution context-0"


Some output is omitted.

Here are some more information:
ReplFakeClass1.class is in the {spark_source_dir}/repl/ReplFakeClass1.class, 
same as ReplFakeClass2 and 3.
ReplSuite failed in running test("local-cluster mode"). The first time running 
this test throws OOM error. The exception shown in above is a second try
The test("local-cluster mode") jvm options are '-Xms512M -Xmx512M' which I see 
from the corresponding stderr log
I have .sbtconfig file in my home dir.  The content is 
export SBT_OPTS="-XX:+CMSClassUnloadingEnabled -XX:PermSize=5120M 
-XX:MaxPermSize=10240M"


The test hung after the test failed in the ReplSuite. I have to Ctr-c to close 
the test.

Thank you for you advice.



-- 
Ye Xianjin
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)