Re: [VOTE] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-30 Thread Ye Xianjin
+1 Sent from my iPhoneOn Apr 30, 2024, at 3:23 PM, DB Tsai  wrote:+1 On Apr 29, 2024, at 8:01 PM, Wenchen Fan  wrote:To add more color:Spark data source table and Hive Serde table are both stored in the Hive metastore and keep the data files in the table directory. The only difference is they have different "table provider", which means Spark will use different reader/writer. Ideally the Spark native data source reader/writer is faster than the Hive Serde ones.What's more, the default format of Hive Serde is text. I don't think people want to use text format tables in production. Most people will add `STORED AS parquet` or `USING parquet` explicitly. By setting this config to false, we have a more reasonable default behavior: creating Parquet tables (or whatever is specified by `spark.sql.sources.default`).On Tue, Apr 30, 2024 at 10:45 AM Wenchen Fan  wrote:@Mich Talebzadeh there seems to be a misunderstanding here. The Spark native data source table is still stored in the Hive metastore, it's just that Spark will use a different (and faster) reader/writer for it. `hive-site.xml` should work as it is today.On Tue, Apr 30, 2024 at 5:23 AM Hyukjin Kwon  wrote:+1It's a legacy conf that we should eventually remove it away. Spark should create Spark table by default, not Hive table.Mich, for your workload, you can simply switch that conf off if it concerns you. We also enabled ANSI as well (that you agreed on). It's a bit akwakrd to stop in the middle for this compatibility reason during making Spark sound. The compatibility has been tested in production for a long time so I don't see any particular issue about the compatibility case you mentioned.On Mon, Apr 29, 2024 at 2:08 AM Mich Talebzadeh  wrote:Hi @Wenchen Fan Thanks for your response. I believe we have not had enough time to "DISCUSS" this matter. Currently in order to make Spark take advantage of Hive, I create a soft link in $SPARK_HOME/conf. FYI, my spark version is 3.4.0 and Hive is 3.1.1 /opt/spark/conf/hive-site.xml -> /data6/hduser/hive-3.1.1/conf/hive-site.xmlThis works fine for me in my lab. So in the future if we opt to use the setting "spark.sql.legacy.createHiveTableByDefault" to False, there will not be a need for this logical link.? On the face of it, this looks fine but in real life it may require a number of changes to the old scripts. Hence my concern. As a matter of interest has anyone liaised with the Hive team to ensure they have introduced the additional changes you outlined?HTHMich Talebzadeh,Technologist | Architect | Data Engineer  | Generative AI | FinCrimeLondonUnited Kingdom

   view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh

 Disclaimer: The information provided is correct to the best of my knowledge but of course cannot be guaranteed . It is essential to note that, as with any advice, quote "one test result is worth one-thousand expert opinions (Werner Von Braun)".

On Sun, 28 Apr 2024 at 09:34, Wenchen Fan  wrote:@Mich Talebzadeh thanks for sharing your concern!Note: creating Spark native data source tables is usually Hive compatible as well, unless we use features that Hive does not support (TIMESTAMP NTZ, ANSI INTERVAL, etc.). I think it's a better default to create Spark native table in this case, instead of creating Hive table and fail.On Sat, Apr 27, 2024 at 12:46 PM Cheng Pan  wrote:+1 (non-binding)

Thanks,
Cheng Pan

On Sat, Apr 27, 2024 at 9:29 AM Holden Karau  wrote:
>
> +1
>
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>
>
> On Fri, Apr 26, 2024 at 12:06 PM L. C. Hsieh  wrote:
>>
>> +1
>>
>> On Fri, Apr 26, 2024 at 10:01 AM Dongjoon Hyun  wrote:
>> >
>> > I'll start with my +1.
>> >
>> > Dongjoon.
>> >
>> > On 2024/04/26 16:45:51 Dongjoon Hyun wrote:
>> > > Please vote on SPARK-46122 to set spark.sql.legacy.createHiveTableByDefault
>> > > to `false` by default. The technical scope is defined in the following PR.
>> > >
>> > > - DISCUSSION:
>> > > https://lists.apache.org/thread/ylk96fg4lvn6klxhj6t6yh42lyqb8wmd
>> > > - JIRA: https://issues.apache.org/jira/browse/SPARK-46122
>> > > - PR: https://github.com/apache/spark/pull/46207
>> > >
>> > > The vote is open until April 30th 1AM (PST) and passes
>> > > if a majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>> > >
>> > > [ ] +1 Set spark.sql.legacy.createHiveTableByDefault to false by default
>> > > [ ] -1 Do not change spark.sql.legacy.createHiveTableByDefault because ...
>> > >
>> > > Thank you in advance.
>> > >
>> > > Dongjoon
>> > >
>> >
>> > -
>> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>> >
>>
>> 

Re: Welcome two new Apache Spark committers

2023-08-06 Thread Ye Xianjin
Congratulations!Sent from my iPhoneOn Aug 7, 2023, at 11:16 AM, Yuming Wang  wrote:Congratulations!On Mon, Aug 7, 2023 at 11:11 AM Kent Yao  wrote:Congrats! Peter and Xiduo!

Cheng Pan  于2023年8月7日周一 11:01写道:
>
> Congratulations! Peter and Xiduo!
>
> Thanks,
> Cheng Pan
>
>
> > On Aug 7, 2023, at 10:58, Gengliang Wang  wrote:
> >
> > Congratulations! Peter and Xiduo!
>
>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org




Re: Running Spark on Kubernetes (GKE) - failing on spark-submit

2023-02-14 Thread Ye Xianjin
The configuration of ‘…file.upload.path’ is wrong. it means a distributed fs path to store your archives/resource/jars temporarily, then distributed by spark to drivers/executors. For your cases, you don’t need to set this configuration.Sent from my iPhoneOn Feb 14, 2023, at 5:43 AM, karan alang  wrote:Hello All,I'm trying to run a simple application on GKE (Kubernetes), and it is failing:Note : I have spark(bitnami spark chart) installed on GKE using helm install  Here is what is done :1. created a docker image using DockerfileDockerfile :```FROM python:3.7-slimRUN apt-get update && \apt-get install -y default-jre && \apt-get install -y openjdk-11-jre-headless && \apt-get cleanENV JAVA_HOME /usr/lib/jvm/java-11-openjdk-amd64RUN pip install pysparkRUN mkdir -p /myexample && chmod 755 /myexampleWORKDIR /myexampleCOPY src/StructuredStream-on-gke.py /myexample/StructuredStream-on-gke.pyCMD ["pyspark"]```Simple pyspark application :```from pyspark.sql import SparkSessionspark = SparkSession.builder.appName("StructuredStreaming-on-gke").getOrCreate()data = "" style="color:rgb(0,128,0);font-weight:bold">'k1', 123000), ('k2', 234000), ('k3', 456000)]df = spark.createDataFrame(data, ('id', 'salary'))df.show(5, False)```Spark-submit command :```





spark-submit --master k8s://https://34.74.22.140:7077 --deploy-mode cluster --name pyspark-example --conf spark.kubernetes.container.image=pyspark-example:0.1 --conf spark.kubernetes.file.upload.path=/myexample src/StructuredStream-on-gke.py```Error i get :```





23/02/13 13:18:27 INFO KubernetesUtils: Uploading file: /Users/karanalang/PycharmProjects/Kafka/pyspark-docker/src/StructuredStream-on-gke.py to dest: /myexample/spark-upload-12228079-d652-4bf3-b907-3810d275124a/StructuredStream-on-gke.py...
Exception in thread "main" org.apache.spark.SparkException: Uploading file /Users/karanalang/PycharmProjects/Kafka/pyspark-docker/src/StructuredStream-on-gke.py failed...
	at org.apache.spark.deploy.k8s.KubernetesUtils$.uploadFileUri(KubernetesUtils.scala:296)
	at org.apache.spark.deploy.k8s.KubernetesUtils$.renameMainAppResource(KubernetesUtils.scala:270)
	at org.apache.spark.deploy.k8s.features.DriverCommandFeatureStep.configureForPython(DriverCommandFeatureStep.scala:109)
	at org.apache.spark.deploy.k8s.features.DriverCommandFeatureStep.configurePod(DriverCommandFeatureStep.scala:44)
	at org.apache.spark.deploy.k8s.submit.KubernetesDriverBuilder.$anonfun$buildFromFeatures$3(KubernetesDriverBuilder.scala:59)
	at scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126)
	at scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122)
	at scala.collection.immutable.List.foldLeft(List.scala:89)
	at org.apache.spark.deploy.k8s.submit.KubernetesDriverBuilder.buildFromFeatures(KubernetesDriverBuilder.scala:58)
	at org.apache.spark.deploy.k8s.submit.Client.run(KubernetesClientApplication.scala:106)
	at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.$anonfun$run$3(KubernetesClientApplication.scala:213)
	at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.$anonfun$run$3$adapted(KubernetesClientApplication.scala:207)
	at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2622)
	at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.run(KubernetesClientApplication.scala:207)
	at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.start(KubernetesClientApplication.scala:179)
	at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:951)
	at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1039)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1048)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: org.apache.spark.SparkException: Error uploading file StructuredStream-on-gke.py
	at org.apache.spark.deploy.k8s.KubernetesUtils$.uploadFileToHadoopCompatibleFS(KubernetesUtils.scala:319)
	at org.apache.spark.deploy.k8s.KubernetesUtils$.uploadFileUri(KubernetesUtils.scala:292)
	... 21 more
Caused by: java.io.IOException: Mkdirs failed to create /myexample/spark-upload-12228079-d652-4bf3-b907-3810d275124a
	at org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:317)
	at org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:305)
	at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1098)
	at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:987)
	at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:414)
	at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:387)
	at org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:2369)
	at org.apache.hadoop.fs.FilterFileSystem.copyFromLocalFile(FilterFileSystem.java:368)
	at 

Re: [VOTE] Accept Uniffle into the Apache Incubator

2022-05-30 Thread Ye Xianjin
+1 (no-binding)

Sent from my iPhone

> On May 31, 2022, at 10:46 AM, Aloys Zhang  wrote:
> 
> +1 (no-binding)
> 
> XiaoYu  于2022年5月31日周二 10:12写道:
> 
>> +1 (no-binding)
>> 
>> Xun Liu  于2022年5月31日周二 10:07写道:
>>> 
>>> +1 (binding) for me.
>>> 
>>> Good luck!
>>> 
>>> On Tue, May 31, 2022 at 10:04 AM Goson zhang 
>> wrote:
>>> 
 +1 (no-binding)
 
 Good luck!!
 
 tison  于2022年5月31日周二 09:43写道:
 
> +1 (binding)
> 
> Best,
> tison.
> 
> 
> Jerry Shao  于2022年5月31日周二 09:37写道:
> 
>> Hi all,
>> 
>> Following up the [DISCUSS] thread on Uniffle[1] and Firestorm[2], I
 would
>> like to
>> call a VOTE to accept Uniffle into the Apache Incubator, please
>> check
 out
>> the Uniffle Proposal from the incubator wiki[3].
>> 
>> Please cast your vote:
>> 
>> [ ] +1, bring Uniffle into the Incubator
>> [ ] +0, I don't care either way
>> [ ] -1, do not bring Uniffle into the Incubator, because...
>> 
>> The vote will open at least for 72 hours, and only votes from the
>> Incubator PMC are binding, but votes from everyone are welcome.
>> 
>> [1]
>> https://lists.apache.org/thread/fyyhkjvhzl4hpzr52hd64csh5lt2wm6h
>> [2]
>> https://lists.apache.org/thread/y07xjkqzvpchncym9zr1hgm3c4l4ql0f
>> [3]
> 
>> https://cwiki.apache.org/confluence/display/INCUBATOR/UniffleProposal
>> 
>> Best regards,
>> Jerry
>> 
> 
 
>> 
>> -
>> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
>> For additional commands, e-mail: general-h...@incubator.apache.org
>> 
>> 

-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org



Re: [DISCUSSION] Incubating Proposal of Uniffle

2022-05-24 Thread Ye Xianjin
+1 (non-binding).

Sent from my iPhone

> On May 25, 2022, at 9:59 AM, Goson zhang  wrote:
> 
> +1 (non-binding)
> 
> Good luck!
> 
> Daniel Widdis  于2022年5月25日周三 09:53写道:
> 
>> +1 (non-binding) from me!  Good luck!
>> 
>> On 5/24/22, 9:05 AM, "Jerry Shao"  wrote:
>> 
>>Hi all,
>> 
>>Due to the name issue in thread (
>>https://lists.apache.org/thread/y07xjkqzvpchncym9zr1hgm3c4l4ql0f), we
>>figured out a new project name "Uniffle" and created a new Thread.
>> Please
>>help to discuss.
>> 
>>We would like to propose Uniffle[1] as a new Apache incubator project,
>> you
>>can find the proposal here [2] for more details.
>> 
>>Uniffle is a high performance, general purpose Remote Shuffle Service
>> for
>>distributed compute engines like Apache Spark
>>, Apache
>>Hadoop MapReduce , Apache Flink
>> and so on. We are aiming to make
>> Firestorm a
>>universal shuffle service for distributed compute engines.
>> 
>>Shuffle is the key part for a distributed compute engine to exchange
>> the
>>data between distributed tasks, the performance and stability of
>> shuffle
>>will directly affect the whole job. Current “local file pull-like
>> shuffle
>>style” has several limitations:
>> 
>>   1. Current shuffle is hard to support super large workloads,
>> especially
>>   in a high load environment, the major problem is IO problem (random
>> disk IO
>>   issue, network congestion and timeout).
>>   2. Current shuffle is hard to deploy on the disaggregated compute
>>   storage environment, as disk capacity is quite limited on compute
>> nodes.
>>   3. The constraint of storing shuffle data locally makes it hard to
>> scale
>>   elastically.
>> 
>>Remote Shuffle Service is the key technology for enterprises to build
>> big
>>data platforms, to expand big data applications to disaggregated,
>>online-offline hybrid environments, and to solve above problems.
>> 
>>The implementation of Remote Shuffle Service -  “Uniffle”  - is heavily
>>adopted in Tencent, and shows its advantages in production. Other
>>enterprises also adopted or prepared to adopt Firestorm in their
>>environments.
>> 
>>Uniffle's key idea is brought from Salfish shuffle
>><
>> https://www.researchgate.net/publication/262241541_Sailfish_a_framework_for_large_scale_data_processing
>>> ,
>>it has several key design goals:
>> 
>>   1. High performance. Firestorm’s performance is close enough to
>> local
>>   file based shuffle style for small workloads. For large workloads,
>> it is
>>   far better than the current shuffle style.
>>   2. Fault tolerance. Firestorm provides high availability for
>> Coordinated
>>   nodes, and failover for Shuffle nodes.
>>   3. Pluggable. Firestorm is highly pluggable, which could be suited
>> to
>>   different compute engines, different backend storages, and different
>>   wire-protocols.
>> 
>>We believe that Uniffle project will provide the great value for the
>>community if it is accepted by the Apache incubator.
>> 
>>I will help this project as champion and many thanks to the 3 mentors:
>> 
>>   -
>> 
>>   Felix Cheung (felixche...@apache.org)
>>   - Junping du (junping...@apache.org)
>>   - Weiwei Yang (w...@apache.org)
>>   - Xun liu (liu...@apache.org)
>>   - Zhankun Tang (zt...@apache.org)
>> 
>> 
>>[1] https://github.com/Tencent/Firestorm
>>[2]
>> https://cwiki.apache.org/confluence/display/INCUBATOR/UniffleProposal
>> 
>>Best regards,
>>Jerry
>> 
>> 
>> 
>> -
>> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
>> For additional commands, e-mail: general-h...@incubator.apache.org
>> 
>> 

-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org



Re: Random expr in join key not support

2021-10-19 Thread Ye Xianjin
> For that, you can add a table subquery and do it in the select list.

Do you mean something like this:
select * from t1 join (select floor(random()*9) + id as x from t2) m on t1.id = 
m.x ?

Yes, that works. But that raise another question: theses two queries seem 
semantically equivalent, yet we treat them differently: one raises an analysis 
exception, one can work well. 
Should we treat them equally?




Sent from my iPhone

> On Oct 20, 2021, at 9:55 AM, Yingyi Bu  wrote:
> 
> 
> Per SQL spec, I think your join query can only be run as a NestedLoopJoin or 
> CartesianProduct.  See page 241 in SQL-99 
> (http://web.cecs.pdx.edu/~len/sql1999.pdf).
> In other words, it might be a correctness bug in other systems if they run 
> your query as a hash join.
> 
> > Here the purpose of adding a random in join key is to resolve the data skew 
> > problem.
> 
> For that, you can add a table subquery and do it in the select list.
> 
> Best,
> Yingyi
> 
> 
>> On Tue, Oct 19, 2021 at 12:46 AM Lantao Jin  wrote:
>> In PostgreSQL and Presto, the below query works well
>> sql> create table t1 (id int);
>> sql> create table t2 (id int);
>> sql> select * from t1 join t2 on t1.id = floor(random() * 9) + t2.id;
>> 
>> But it throws "Error in query: nondeterministic expressions are only allowed 
>> in Project, Filter, Aggregate or Window". Why Spark doesn't support random 
>> expressions in join condition?
>> Here the purpose to add a random in join key is to resolve the data skew 
>> problem.
>> 
>> Thanks,
>> Lantao


Re: [DISCUSS] SPIP: FunctionCatalog

2021-02-15 Thread Ye Xianjin
Hi,

Thanks for Ryan and Wenchen for leading this.

I’d like to add my two cents here. In production environments, the function 
catalog might be used by multiple systems, such as Spark, Presto and Hive.  Is 
it possible that this function catalog is designed with as an unified function 
catalog in mind, or at least it wouldn’t that difficult to extend this catalog 
as an unified one. 

P.S. We registered a lot of UDFs in hive HMS in our production environment, and 
those UDFs are shared by Spark and Presto. It works well even through with a 
lot of drawbacks.

Sent from my iPhone

> On Feb 16, 2021, at 2:44 AM, Ryan Blue  wrote:
> 
> 
> Thanks for the positive feedback, everyone. It sounds like there is a clear 
> path forward for calling functions. Even without a prototype, the `invoke` 
> plans show that Wenchen's suggested optimization can be done, and 
> incorporating it as an optional extension to this proposal solves many of the 
> unknowns.
> 
> With that area now understood, is there any discussion about other parts of 
> the proposal, besides the function call interface?
> 
>> On Fri, Feb 12, 2021 at 10:40 PM Chao Sun  wrote:
>> This is an important feature which can unblock several other projects 
>> including bucket join support for DataSource v2, complete support for 
>> enforcing DataSource v2 distribution requirements on the write path, etc. I 
>> like Ryan's proposals which look simple and elegant, with nice support on 
>> function overloading and variadic arguments. On the other hand, I think 
>> Wenchen made a very good point about performance. Overall, I'm excited to 
>> see active discussions on this topic and believe the community will come to 
>> a proposal with the best of both sides.
>> 
>> Chao
>> 
>>> On Fri, Feb 12, 2021 at 7:58 PM Hyukjin Kwon  wrote:
>>> +1 for Liang-chi's.
>>> 
>>> Thanks Ryan and Wenchen for leading this.
>>> 
>>> 
>>> 2021년 2월 13일 (토) 오후 12:18, Liang-Chi Hsieh 님이 작성:
 Basically I think the proposal makes sense to me and I'd like to support 
 the
 SPIP as it looks like we have strong need for the important feature.
 
 Thanks Ryan for working on this and I do also look forward to Wenchen's
 implementation. Thanks for the discussion too.
 
 Actually I think the SupportsInvoke proposed by Ryan looks a good
 alternative to me. Besides Wenchen's alternative implementation, is there a
 chance we also have the SupportsInvoke for comparison?
 
 
 John Zhuge wrote
 > Excited to see our Spark community rallying behind this important 
 > feature!
 > 
 > The proposal lays a solid foundation of minimal feature set with careful
 > considerations for future optimizations and extensions. Can't wait to see
 > it leading to more advanced functionalities like views with shared custom
 > functions, function pushdown, lambda, etc. It has already borne fruit 
 > from
 > the constructive collaborations in this thread. Looking forward to
 > Wenchen's prototype and further discussions including the SupportsInvoke
 > extension proposed by Ryan.
 > 
 > 
 > On Fri, Feb 12, 2021 at 4:35 PM Owen O'Malley 
 
 > owen.omalley@
 
 > 
 > wrote:
 > 
 >> I think this proposal is a very good thing giving Spark a standard way 
 >> of
 >> getting to and calling UDFs.
 >>
 >> I like having the ScalarFunction as the API to call the UDFs. It is
 >> simple, yet covers all of the polymorphic type cases well. I think it
 >> would
 >> also simplify using the functions in other contexts like pushing down
 >> filters into the ORC & Parquet readers although there are a lot of
 >> details
 >> that would need to be considered there.
 >>
 >> .. Owen
 >>
 >>
 >> On Fri, Feb 12, 2021 at 11:07 PM Erik Krogen 
 
 > ekrogen@.com
 
 > 
 >> wrote:
 >>
 >>> I agree that there is a strong need for a FunctionCatalog within Spark
 >>> to
 >>> provide support for shareable UDFs, as well as make movement towards
 >>> more
 >>> advanced functionality like views which themselves depend on UDFs, so I
 >>> support this SPIP wholeheartedly.
 >>>
 >>> I find both of the proposed UDF APIs to be sufficiently user-friendly
 >>> and
 >>> extensible. I generally think Wenchen's proposal is easier for a user 
 >>> to
 >>> work with in the common case, but has greater potential for confusing
 >>> and
 >>> hard-to-debug behavior due to use of reflective method signature
 >>> searches.
 >>> The merits on both sides can hopefully be more properly examined with
 >>> code,
 >>> so I look forward to seeing an implementation of Wenchen's ideas to
 >>> provide
 >>> a more concrete comparison. I am optimistic that we will not let the
 >>> debate
 >>> over this point unreasonably stall the SPIP from making progress.
 >>>
 >>> Thank 

Re: Welcoming some new committers and PMC members

2019-09-09 Thread Ye Xianjin
Congratulations!

Sent from my iPhone

> On Sep 10, 2019, at 9:19 AM, Jeff Zhang  wrote:
> 
> Congratulations!
> 
> Saisai Shao  于2019年9月10日周二 上午9:16写道:
>> Congratulations!
>> 
>> Jungtaek Lim  于2019年9月9日周一 下午6:11写道:
>>> Congratulations! Well deserved!
>>> 
 On Tue, Sep 10, 2019 at 9:51 AM John Zhuge  wrote:
 Congratulations!
 
> On Mon, Sep 9, 2019 at 5:45 PM Shane Knapp  wrote:
> congrats everyone!  :)
> 
> On Mon, Sep 9, 2019 at 5:32 PM Matei Zaharia  
> wrote:
> >
> > Hi all,
> >
> > The Spark PMC recently voted to add several new committers and one PMC 
> > member. Join me in welcoming them to their new roles!
> >
> > New PMC member: Dongjoon Hyun
> >
> > New committers: Ryan Blue, Liang-Chi Hsieh, Gengliang Wang, Yuming 
> > Wang, Weichen Xu, Ruifeng Zheng
> >
> > The new committers cover lots of important areas including ML, SQL, and 
> > data sources, so it’s great to have them here. All the best,
> >
> > Matei and the Spark PMC
> >
> >
> > -
> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >
> 
> 
> -- 
> Shane Knapp
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
> 
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> 
 
 
 -- 
 John Zhuge
>>> 
>>> 
>>> -- 
>>> Name : Jungtaek Lim
>>> Blog : http://medium.com/@heartsavior
>>> Twitter : http://twitter.com/heartsavior
>>> LinkedIn : http://www.linkedin.com/in/heartsavior
> 
> 
> -- 
> Best Regards
> 
> Jeff Zhang


Re: Why does a 3.8 T dataset take up 11.59 Tb on HDFS

2015-11-24 Thread Ye Xianjin
Hi AlexG:

Files(blocks more specifically) has 3 copies on HDFS by default. So 3.8 * 3 = 
11.4TB.  

-- 
Ye Xianjin
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)


On Wednesday, November 25, 2015 at 2:31 PM, AlexG wrote:

> I downloaded a 3.8 T dataset from S3 to a freshly launched spark-ec2 cluster
> with 16.73 Tb storage, using
> distcp. The dataset is a collection of tar files of about 1.7 Tb each.
> Nothing else was stored in the HDFS, but after completing the download, the
> namenode page says that 11.59 Tb are in use. When I use hdfs du -h -s, I see
> that the dataset only takes up 3.8 Tb as expected. I navigated through the
> entire HDFS hierarchy from /, and don't see where the missing space is. Any
> ideas what is going on and how to rectify it?
> 
> I'm using the spark-ec2 script to launch, with the command
> 
> spark-ec2 -k key -i ~/.ssh/key.pem -s 29 --instance-type=r3.8xlarge
> --placement-group=pcavariants --copy-aws-credentials
> --hadoop-major-version=yarn --spot-price=2.8 --region=us-west-2 launch
> conversioncluster
> 
> and am not modifying any configuration files for Hadoop.
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Why-does-a-3-8-T-dataset-take-up-11-59-Tb-on-HDFS-tp25471.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com 
> (http://Nabble.com).
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
> (mailto:user-unsubscr...@spark.apache.org)
> For additional commands, e-mail: user-h...@spark.apache.org 
> (mailto:user-h...@spark.apache.org)
> 
> 




Re: An interesting and serious problem I encountered

2015-02-13 Thread Ye Xianjin
Hi, 

I believe SizeOf.jar may calculate the wrong size for you.
 Spark has a util call SizeEstimator located in 
org.apache.spark.util.SizeEstimator. And some one extracted it out in 
https://github.com/phatak-dev/java-sizeof/blob/master/src/main/scala/com/madhukaraphatak/sizeof/SizeEstimator.scala
You can try that out in the scala repl. 
The size for Array[Int](43) is 192bytes (12 bytes object size + 4 bytes length 
variable + (43 * 4 round to 176 bytes))
 And the size for (1, Array[Int](43)) is 240 bytes {
   Tuple2 Object: 12 bytes object size + 4 bytes filed _1 + 4 byes field _2 = 
round to 24 bytes
   1 =  java.lang.Number 12  bytes = round to 16 bytes - java.lang.Integer: 
16 bytes + 4 bytes int = round to 24 bytes ( Integer extends Number. I thought 
Scala Tuple2 will specialized Int and this should be 4, but it seems not)
   Array = 192 bytes
}

So, 24 + 24 + 192 = 240 bytes.
This is my calculation based on the spark SizeEstimator. 

However I am not sure what an Integer will occupy for 64 bits JVM with 
compressedOps on. It should be 12 + 4 = 16 bytes, then that means the 
SizeEstimator gives the wrong result. @Sean what do you think?
-- 
Ye Xianjin
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)


On Friday, February 13, 2015 at 2:26 PM, Landmark wrote:

 Hi foks,
 
 My Spark cluster has 8 machines, each of which has 377GB physical memory,
 and thus the total maximum memory can be used for Spark is more than
 2400+GB. In my program, I have to deal with 1 billion of (key, value) pairs,
 where the key is an integer and the value is an integer array with 43
 elements. Therefore, the memory cost of this raw dataset is [(1+43) *
 10 * 4] / (1024 * 1024 * 1024) = 164GB. 
 
 Since I have to use this dataset repeatedly, I have to cache it in memory.
 Some key parameter settings are: 
 spark.storage.fraction=0.6
 spark.driver.memory=30GB
 spark.executor.memory=310GB.
 
 But it failed on running a simple countByKey() and the error message is
 java.lang.OutOfMemoryError: Java heap space Does this mean a Spark
 cluster of 2400+GB memory cannot keep 164GB raw data in memory? 
 
 The codes of my program is as follows:
 
 def main(args: Array[String]):Unit = {
 val sc = new SparkContext(new SparkConfig());
 
 val rdd = sc.parallelize(0 until 10, 25600).map(i = (i, new
 Array[Int](43))).cache();
 println(The number of keys is  + rdd.countByKey());
 
 //some other operations following here ...
 }
 
 
 
 
 To figure out the issue, I evaluated the memory cost of key-value pairs and
 computed their memory cost using SizeOf.jar. The codes are as follows:
 
 val arr = new Array[Int](43);
 println(SizeOf.humanReadable(SizeOf.deepSizeOf(arr)));
 
 val tuple = (1, arr.clone);
 println(SizeOf.humanReadable(SizeOf.deepSizeOf(tuple)));
 
 The output is:
 192.0b
 992.0b
 
 
 *Hard to believe, but it is true!! This result means, to store a key-value
 pair, Tuple2 needs more than 5+ times memory than the simplest method with
 array. Even though it may take 5+ times memory, its size is less than
 1000GB, which is still much less than the total memory size of my cluster,
 i.e., 2400+GB. I really do not understand why this happened.*
 
 BTW, if the number of pairs is 1 million, it works well. If the arr contains
 only 1 integer, to store a pair, Tuples needs around 10 times memory.
 
 So I have some questions:
 1. Why does Spark choose such a poor data structure, Tuple2, for key-value
 pairs? Is there any better data structure for storing (key, value) pairs
 with less memory cost ?
 2. Given a dataset with size of M, in general Spark how many times of memory
 to handle it?
 
 
 Best,
 Landmark
 
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/An-interesting-and-serious-problem-I-encountered-tp21637.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com 
 (http://Nabble.com).
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
 (mailto:user-unsubscr...@spark.apache.org)
 For additional commands, e-mail: user-h...@spark.apache.org 
 (mailto:user-h...@spark.apache.org)
 
 




Re: Welcoming three new committers

2015-02-03 Thread Ye Xianjin
Congratulations!

-- 
Ye Xianjin
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)


On Wednesday, February 4, 2015 at 6:34 AM, Matei Zaharia wrote:

 Hi all,
 
 The PMC recently voted to add three new committers: Cheng Lian, Joseph 
 Bradley and Sean Owen. All three have been major contributors to Spark in the 
 past year: Cheng on Spark SQL, Joseph on MLlib, and Sean on ML and many 
 pieces throughout Spark Core. Join me in welcoming them as committers!
 
 Matei
 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org 
 (mailto:dev-unsubscr...@spark.apache.org)
 For additional commands, e-mail: dev-h...@spark.apache.org 
 (mailto:dev-h...@spark.apache.org)
 
 




[jira] [Commented] (SPARK-4631) Add real unit test for MQTT

2015-01-29 Thread Ye Xianjin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14296811#comment-14296811
 ] 

Ye Xianjin commented on SPARK-4631:
---

[~dragos], Thread.sleep(50) do pass the test on my machine. 

 Add real unit test for MQTT 
 

 Key: SPARK-4631
 URL: https://issues.apache.org/jira/browse/SPARK-4631
 Project: Spark
  Issue Type: Test
  Components: Streaming
Reporter: Tathagata Das
Priority: Critical
 Fix For: 1.3.0


 A real unit test that actually transfers data to ensure that the MQTTUtil is 
 functional



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-4631) Add real unit test for MQTT

2015-01-29 Thread Ye Xianjin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ye Xianjin updated SPARK-4631:
--
Comment: was deleted

(was: [~dragos], Thread.sleep(50) do pass the test on my machine. )

 Add real unit test for MQTT 
 

 Key: SPARK-4631
 URL: https://issues.apache.org/jira/browse/SPARK-4631
 Project: Spark
  Issue Type: Test
  Components: Streaming
Reporter: Tathagata Das
Priority: Critical
 Fix For: 1.3.0


 A real unit test that actually transfers data to ensure that the MQTTUtil is 
 functional



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4631) Add real unit test for MQTT

2015-01-29 Thread Ye Xianjin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14296812#comment-14296812
 ] 

Ye Xianjin commented on SPARK-4631:
---

[~dragos], Thread.sleep(50) do pass the test on my machine. 

 Add real unit test for MQTT 
 

 Key: SPARK-4631
 URL: https://issues.apache.org/jira/browse/SPARK-4631
 Project: Spark
  Issue Type: Test
  Components: Streaming
Reporter: Tathagata Das
Priority: Critical
 Fix For: 1.3.0


 A real unit test that actually transfers data to ensure that the MQTTUtil is 
 functional



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.2.1 (RC2)

2015-01-28 Thread Ye Xianjin
Sean,  
the MQRRStreamSuite is also failed for me on Mac OS X, Though I don’t have time 
to invest that.

--  
Ye Xianjin
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)


On Wednesday, January 28, 2015 at 9:17 PM, Sean Owen wrote:

 +1 (nonbinding). I verified that all the hash / signing items I
 mentioned before are resolved.
  
 The source package compiles on Ubuntu / Java 8. I ran tests and the
 passed. Well, actually I see the same failure I've seeing locally on
 OS X and on Ubuntu for a while, but I think nobody else has seen this?
  
 MQTTStreamSuite:
 - mqtt input stream *** FAILED ***
 org.eclipse.paho.client.mqttv3.MqttException: Too many publishes in progress
 at 
 org.eclipse.paho.client.mqttv3.internal.ClientState.send(ClientState.java:423)
  
 Doesn't happen on Jenkins. If nobody else is seeing this, I suspect it
 is something perhaps related to my env that I haven't figured out yet,
 so should not be considered a blocker.
  
 On Wed, Jan 28, 2015 at 10:06 AM, Patrick Wendell pwend...@gmail.com 
 (mailto:pwend...@gmail.com) wrote:
  Please vote on releasing the following candidate as Apache Spark version 
  1.2.1!
   
  The tag to be voted on is v1.2.1-rc1 (commit b77f876):
  https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=b77f87673d1f9f03d4c83cf583158227c551359b
   
  The release files, including signatures, digests, etc. can be found at:
  http://people.apache.org/~pwendell/spark-1.2.1-rc2/
   
  Release artifacts are signed with the following key:
  https://people.apache.org/keys/committer/pwendell.asc
   
  The staging repository for this release can be found at:
  https://repository.apache.org/content/repositories/orgapachespark-1062/
   
  The documentation corresponding to this release can be found at:
  http://people.apache.org/~pwendell/spark-1.2.1-rc2-docs/
   
  Changes from rc1:
  This has no code changes from RC1. Only minor changes to the release script.
   
  Please vote on releasing this package as Apache Spark 1.2.1!
   
  The vote is open until Saturday, January 31, at 10:04 UTC and passes
  if a majority of at least 3 +1 PMC votes are cast.
   
  [ ] +1 Release this package as Apache Spark 1.2.1
  [ ] -1 Do not release this package because ...
   
  For a list of fixes in this release, see http://s.apache.org/Mpn.
   
  To learn more about Apache Spark, please see
  http://spark.apache.org/
   
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org 
  (mailto:dev-unsubscr...@spark.apache.org)
  For additional commands, e-mail: dev-h...@spark.apache.org 
  (mailto:dev-h...@spark.apache.org)
   
  
  
 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org 
 (mailto:dev-unsubscr...@spark.apache.org)
 For additional commands, e-mail: dev-h...@spark.apache.org 
 (mailto:dev-h...@spark.apache.org)
  
  




[jira] [Commented] (SPARK-4631) Add real unit test for MQTT

2015-01-28 Thread Ye Xianjin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14296232#comment-14296232
 ] 

Ye Xianjin commented on SPARK-4631:
---

Hi [~dragos], I have the same issue here. I'd like to copy the email I sent to 
Sean here, which may help. 

{quote}
Hi Sean:

I enabled the debug flag in log4j. I believe the MQRRStreamSuite failure is 
more likely due to some weird network issue. However I cannot understand why 
this exception will be thrown.

what I saw in the unit-tests.log is below:
15/01/28 23:41:37.390 ActiveMQ Transport: tcp:///127.0.0.1:53845@23456 DEBUG 
Transport: Transport Connection to: tcp://127.0.0.1:53845 failed: 
java.net.ProtocolException: Invalid CONNECT encoding
java.net.ProtocolException: Invalid CONNECT encoding
at org.fusesource.mqtt.codec.CONNECT.decode(CONNECT.java:77)
at 
org.apache.activemq.transport.mqtt.MQTTProtocolConverter.onMQTTCommand(MQTTProtocolConverter.java:118)
at 
org.apache.activemq.transport.mqtt.MQTTTransportFilter.onCommand(MQTTTransportFilter.java:74)
at 
org.apache.activemq.transport.TransportSupport.doConsume(TransportSupport.java:83)
at 
org.apache.activemq.transport.tcp.TcpTransport.doRun(TcpTransport.java:222)
at 
org.apache.activemq.transport.tcp.TcpTransport.run(TcpTransport.java:204)
at java.lang.Thread.run(Thread.java:695)

However when I looked at the code 
http://grepcode.com/file/repo1.maven.org/maven2/org.fusesource.mqtt-client/mqtt-client/1.3/org/fusesource/mqtt/codec/CONNECT.java#76
 , I don’t quite understand why that would happen.
I am not familiar with activemq, maybe you can look at this and figure what 
really happened.
{quote}

The possible cause for that failure is that maybe org.eclipse.paho.mqtt-client 
don't write PROTOCOL_NAME in the mqtt frame with a quick look at the 
paho.mqtt-client code. But it don't make sense as the Jenkins run test 
successfully and I am not sure.

 Add real unit test for MQTT 
 

 Key: SPARK-4631
 URL: https://issues.apache.org/jira/browse/SPARK-4631
 Project: Spark
  Issue Type: Test
  Components: Streaming
Reporter: Tathagata Das
Priority: Critical
 Fix For: 1.3.0


 A real unit test that actually transfers data to ensure that the MQTTUtil is 
 functional



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: Can't run Spark java code from command line

2015-01-13 Thread Ye Xianjin
There is no binding issue here. Spark picks the right ip 10.211.55.3 for you. 
The printed message is just an indication.
 However I have no idea why spark-shell hangs or stops.

发自我的 iPhone

 在 2015年1月14日,上午5:10,Akhil Das ak...@sigmoidanalytics.com 写道:
 
 It just a binding issue with the hostnames in your /etc/hosts file. You can 
 set SPARK_LOCAL_IP and SPARK_MASTER_IP in your conf/spark-env.sh file and 
 restart your cluster. (in that case the spark://myworkstation:7077 will 
 change to the ip address that you provided eg: spark://10.211.55.3).
 
 Thanks
 Best Regards
 
 On Tue, Jan 13, 2015 at 11:15 PM, jeremy p athomewithagroove...@gmail.com 
 wrote:
 Hello all,
 
 I wrote some Java code that uses Spark, but for some reason I can't run it 
 from the command line.  I am running Spark on a single node (my 
 workstation). The program stops running after this line is executed :
 
 SparkContext sparkContext = new SparkContext(spark://myworkstation:7077, 
 sparkbase);
 
 When that line is executed, this is printed to the screen : 
 15/01/12 15:56:19 WARN util.Utils: Your hostname, myworkstation resolves to 
 a loopback address: 127.0.1.1; using 10.211.55.3 instead (on interface eth0)
 15/01/12 15:56:19 WARN util.Utils: Set SPARK_LOCAL_IP if you need to bind to 
 another address
 15/01/12 15:56:19 INFO spark.SecurityManager: Changing view acls to: 
 myusername
 15/01/12 15:56:19 INFO spark.SecurityManager: Changing modify acls to: 
 myusername
 15/01/12 15:56:19 INFO spark.SecurityManager: SecurityManager: 
 authentication disabled; ui acls disabled; users with view permissions: 
 Set(myusername); users with modify permissions: Set(myusername)
 
 After it writes this to the screen, the program stops executing without 
 reporting an exception.
 
 What's odd is that when I run this code from Eclipse, the same lines are 
 printed to the screen, but the program keeps executing.
 
 Don't know if it matters, but I'm using the maven assembly plugin, which 
 includes the dependencies in the JAR.
 
 Here are the versions I'm using :
 Cloudera : 2.5.0-cdh5.2.1
 Hadoop : 2.5.0-cdh5.2.1
 HBase : HBase 0.98.6-cdh5.2.1
 Java : 1.7.0_65
 Ubuntu : 14.04.1 LTS
 Spark : 1.2
 


[jira] [Created] (SPARK-5201) ParallelCollectionRDD.slice(seq, numSlices) has int overflow when dealing with inclusive range

2015-01-11 Thread Ye Xianjin (JIRA)
Ye Xianjin created SPARK-5201:
-

 Summary: ParallelCollectionRDD.slice(seq, numSlices) has int 
overflow when dealing with inclusive range
 Key: SPARK-5201
 URL: https://issues.apache.org/jira/browse/SPARK-5201
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Ye Xianjin
 Fix For: 1.2.1


{code}
 sc.makeRDD(1 to (Int.MaxValue)).count   // result = 0
 sc.makeRDD(1 to (Int.MaxValue - 1)).count   // result = 2147483646 = 
Int.MaxValue - 1
 sc.makeRDD(1 until (Int.MaxValue)).count// result = 2147483646 = 
Int.MaxValue - 1
{code}
More details on the discussion https://github.com/apache/spark/pull/2874



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5201) ParallelCollectionRDD.slice(seq, numSlices) has int overflow when dealing with inclusive range

2015-01-11 Thread Ye Xianjin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14273277#comment-14273277
 ] 

Ye Xianjin commented on SPARK-5201:
---

I will send a pr for this.

 ParallelCollectionRDD.slice(seq, numSlices) has int overflow when dealing 
 with inclusive range
 --

 Key: SPARK-5201
 URL: https://issues.apache.org/jira/browse/SPARK-5201
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Ye Xianjin
  Labels: rdd
 Fix For: 1.2.1

   Original Estimate: 2h
  Remaining Estimate: 2h

 {code}
  sc.makeRDD(1 to (Int.MaxValue)).count   // result = 0
  sc.makeRDD(1 to (Int.MaxValue - 1)).count   // result = 2147483646 = 
 Int.MaxValue - 1
  sc.makeRDD(1 until (Int.MaxValue)).count// result = 2147483646 = 
 Int.MaxValue - 1
 {code}
 More details on the discussion https://github.com/apache/spark/pull/2874



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: Is it safe to use Scala 2.11 for Spark build?

2014-11-17 Thread Ye Xianjin
$$withChannelRetries$1(Locks.scala:78)
at xsbt.boot.Locks$GlobalLock$$anonfun$withFileLock$1.apply(Locks.scala:97)
at xsbt.boot.Using$.withResource(Using.scala:10)
at xsbt.boot.Using$.apply(Using.scala:9)
at xsbt.boot.Locks$GlobalLock.ignoringDeadlockAvoided(Locks.scala:58)
at xsbt.boot.Locks$GlobalLock.withLock(Locks.scala:48)
commit c6e0c2ab1c29c184a9302d23ad75e4ccd8060242
at xsbt.boot.Locks$.apply0(Locks.scala:31)
at xsbt.boot.Locks$.apply(Locks.scala:28)
at sbt.IvySbt.withDefaultLogger(Ivy.scala:64)
at sbt.IvySbt.withIvy(Ivy.scala:119)
at sbt.IvySbt.withIvy(Ivy.scala:116)
at sbt.IvySbt$Module.withModule(Ivy.scala:147)
at sbt.IvyActions$.updateEither(IvyActions.scala:156)
at sbt.Classpaths$$anonfun$sbt$Classpaths$$work$1$1.apply(Defaults.scala:1282)
at sbt.Classpaths$$anonfun$sbt$Classpaths$$work$1$1.apply(Defaults.scala:1279)
at sbt.Classpaths$$anonfun$doWork$1$1$$anonfun$84.apply(Defaults.scala:1309)
at sbt.Classpaths$$anonfun$doWork$1$1$$anonfun$84.apply(Defaults.scala:1307)
at sbt.Tracked$$anonfun$lastOutput$1.apply(Tracked.scala:35)
at sbt.Classpaths$$anonfun$doWork$1$1.apply(Defaults.scala:1312)
at sbt.Classpaths$$anonfun$doWork$1$1.apply(Defaults.scala:1306)
at sbt.Tracked$$anonfun$inputChanged$1.apply(Tracked.scala:45)
at sbt.Classpaths$.cachedUpdate(Defaults.scala:1324)
at sbt.Classpaths$$anonfun$updateTask$1.apply(Defaults.scala:1264)
at sbt.Classpaths$$anonfun$updateTask$1.apply(Defaults.scala:1242)
at scala.Function1$$anonfun$compose$1.apply(Function1.scala:47)
at sbt.$tilde$greater$$anonfun$$u2219$1.apply(TypeFunctions.scala:40)
at sbt.std.Transform$$anon$4.work(System.scala:63)
at sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:226)
at sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:226)
at sbt.ErrorHandling$.wideConvert(ErrorHandling.scala:17)
at sbt.Execute.work(Execute.scala:235)
at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:226)
at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:226)
at 
sbt.ConcurrentRestrictions$$anon$4$$anonfun$1.apply(ConcurrentRestrictions.scala:159)
at sbt.CompletionService$$anon$2.call(CompletionService.scala:28)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
[error] (streaming-kafka/*:update) sbt.ResolveException: unresolved dependency: 
org.apache.kafka#kafka_2.11;0.8.0: not found
[error] (catalyst/*:update) sbt.ResolveException: unresolved dependency: 
org.scalamacros#quasiquotes_2.11;2.0.1: not found



-- 
Ye Xianjin
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)


On Tuesday, November 18, 2014 at 3:27 PM, Prashant Sharma wrote:

 It is safe in the sense we would help you with the fix if you run into 
 issues. I have used it, but since I worked on the patch the opinion can be 
 biased. I am using scala 2.11 for day to day development. You should checkout 
 the build instructions here : 
 https://github.com/ScrapCodes/spark-1/blob/patch-3/docs/building-spark.md 
 
 Prashant Sharma
 
 
 
 On Tue, Nov 18, 2014 at 12:19 PM, Jianshi Huang jianshi.hu...@gmail.com 
 (mailto:jianshi.hu...@gmail.com) wrote:
  Any notable issues for using Scala 2.11? Is it stable now?
  
  Or can I use Scala 2.11 in my spark application and use Spark dist build 
  with 2.10 ?
  
  I'm looking forward to migrate to 2.11 for some quasiquote features. 
  Couldn't make it run in 2.10...
  
  Cheers,-- 
  Jianshi Huang
  
  LinkedIn: jianshi
  Twitter: @jshuang
  Github  Blog: http://huangjs.github.com/
 



[jira] [Commented] (FLUME-2385) Flume spans log file with Spooling Directory Source runner has shutdown messages at INFO level

2014-11-10 Thread Ye Xianjin (JIRA)

[ 
https://issues.apache.org/jira/browse/FLUME-2385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14205891#comment-14205891
 ] 

Ye Xianjin commented on FLUME-2385:
---

hi, [~scaph01], I think (according to my colleague) the more reasonable change 
is to set the log level to debug. 

 Flume spans log file with Spooling Directory Source runner has shutdown 
 messages at INFO level
 

 Key: FLUME-2385
 URL: https://issues.apache.org/jira/browse/FLUME-2385
 Project: Flume
  Issue Type: Improvement
Affects Versions: v1.4.0
Reporter: Justin Hayes
Assignee: Phil Scala
Priority: Minor
 Fix For: v1.6.0

 Attachments: FLUME-2385-0.patch


 When I start an agent with the following config, the spooling directory 
 source emits 14/05/14 22:36:12 INFO source.SpoolDirectorySource: Spooling 
 Directory Source runner has shutdown. messages twice a second. Pretty 
 innocuous but it will fill up the file system needlessly and get in the way 
 of other INFO messages.
 cis.sources = httpd
 cis.sinks = loggerSink
 cis.channels = mem2logger
 cis.sources.httpd.type = spooldir
 cis.sources.httpd.spoolDir = /var/log/httpd
 cis.sources.httpd.trackerDir = /var/lib/flume-ng/tracker/httpd
 cis.sources.httpd.channels = mem2logger
 cis.sinks.loggerSink.type = logger
 cis.sinks.loggerSink.channel = mem2logger
 cis.channels.mem2logger.type = memory
 cis.channels.mem2logger.capacity = 1
 cis.channels.mem2logger.transactionCapacity = 1000 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (SPARK-4002) JavaKafkaStreamSuite.testKafkaStream fails on OSX

2014-10-22 Thread Ye Xianjin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14179753#comment-14179753
 ] 

Ye Xianjin commented on SPARK-4002:
---

Hi, [~rdub] what's your mac os x's hostname ? Mine was advancedxy's-pro. 
notice the illegal ['] there in the hostname. That's causing Kafka failing. 
That's what I saw a Kafka related test failure couple weeks ago. Hope It's 
related. 
The detail is in the unit-tests.log. So, as [~jerryshao] said, It's better you 
post your unit test log here and we may get the real cause.

 JavaKafkaStreamSuite.testKafkaStream fails on OSX
 -

 Key: SPARK-4002
 URL: https://issues.apache.org/jira/browse/SPARK-4002
 Project: Spark
  Issue Type: Bug
  Components: Streaming
 Environment: Mac OSX 10.9.5.
Reporter: Ryan Williams

 [~sowen] mentioned this on spark-dev 
 [here|http://mail-archives.apache.org/mod_mbox/spark-dev/201409.mbox/%3ccamassdjs0fmsdc-k-4orgbhbfz2vvrmm0hfyifeeal-spft...@mail.gmail.com%3E]
  and I just reproduced it on {{master}} 
 ([7e63bb4|https://github.com/apache/spark/commit/7e63bb49c526c3f872619ae14e4b5273f4c535e9]).
 The relevant output I get when running {{./dev/run-tests}} is:
 {code}
 [info] KafkaStreamSuite:
 [info] - Kafka input stream
 [info] Test run started
 [info] Test 
 org.apache.spark.streaming.kafka.JavaKafkaStreamSuite.testKafkaStream started
 [error] Test 
 org.apache.spark.streaming.kafka.JavaKafkaStreamSuite.testKafkaStream failed: 
 junit.framework.AssertionFailedError: expected:3 but was:0
 [error] at junit.framework.Assert.fail(Assert.java:50)
 [error] at junit.framework.Assert.failNotEquals(Assert.java:287)
 [error] at junit.framework.Assert.assertEquals(Assert.java:67)
 [error] at junit.framework.Assert.assertEquals(Assert.java:199)
 [error] at junit.framework.Assert.assertEquals(Assert.java:205)
 [error] at 
 org.apache.spark.streaming.kafka.JavaKafkaStreamSuite.testKafkaStream(JavaKafkaStreamSuite.java:129)
 [error] ...
 [info] Test run finished: 1 failed, 0 ignored, 1 total, 19.798s
 {code}
 Seems like this test should be {{@Ignore}}'d, or some note about this made on 
 the {{README.md}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: spark_classpath in core/pom.xml and yarn/porm.xml

2014-09-25 Thread Ye Xianjin
hi, Sandy Ryza:
 I believe It's you originally added the SPARK_CLASSPATH in core/pom.xml in 
the org.scalatest section. Does this still needed in 1.1?
 I noticed this setting because when I looked into the unit-tests.log, It 
shows something below:
 14/09/24 23:57:19.246 WARN SparkConf:
 SPARK_CLASSPATH was detected (set to 'null').
 This is deprecated in Spark 1.0+.
 
 Please instead use:
  - ./spark-submit with --driver-class-path to augment the driver classpath
  - spark.executor.extraClassPath to augment the executor classpath
 
 14/09/24 23:57:19.246 WARN SparkConf: Setting 'spark.executor.extraClassPath' 
 to 'null' as a work-around.
 14/09/24 23:57:19.247 WARN SparkConf: Setting 'spark.driver.extraClassPath' 
 to 'null' as a work-around.

However I didn't set SPARK_CLASSPATH env variable. And looked into the 
SparkConf.scala, If user actually set extraClassPath,  the SparkConf will throw 
SparkException.
-- 
Ye Xianjin
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)


On Tuesday, September 23, 2014 at 12:56 AM, Ye Xianjin wrote:

 Hi:
 I notice the scalatest-maven-plugin set SPARK_CLASSPATH environment 
 variable for testing. But in the SparkConf.scala, this is deprecated in Spark 
 1.0+.
 So what this variable for? should we just remove this variable?
 
 
 -- 
 Ye Xianjin
 Sent with Sparrow (http://www.sparrowmailapp.com/?sig)
 



Re: spark_classpath in core/pom.xml and yarn/pom.xml

2014-09-25 Thread Ye Xianjin
Hi Sandy, 

Sorry for the bothering. 

The tests run ok even the SPARK_CLASS setting is there now, but It gives a 
config warning and will potential interfere other settings like Marcelo said. 
The warning goes away if I remove it out.

And Marcelo, I believe the setting in core/pom should not be used any more. But 
I don't think it's worthy to file a JIRA for such small change. Maybe put it 
into other related JIRA. It's a pity that your pr
already got merged.
 

-- 
Ye Xianjin
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)


On Friday, September 26, 2014 at 6:29 AM, Sandy Ryza wrote:

 Hi Ye,
 
 I think git blame shows me because I fixed the formatting in core/pom.xml, 
 but I don't actually know the original reason for setting SPARK_CLASSPATH 
 there.
 
 Do the tests run OK if you take it out?
 
 -Sandy
 
 
 On Thu, Sep 25, 2014 at 1:59 AM, Ye Xianjin advance...@gmail.com 
 (mailto:advance...@gmail.com) wrote:
  hi, Sandy Ryza:
   I believe It's you originally added the SPARK_CLASSPATH in 
  core/pom.xml in the org.scalatest section. Does this still needed in 1.1?
   I noticed this setting because when I looked into the unit-tests.log, 
  It shows something below:
   14/09/24 23:57:19.246 WARN SparkConf:
   SPARK_CLASSPATH was detected (set to 'null').
   This is deprecated in Spark 1.0+.
  
   Please instead use:
- ./spark-submit with --driver-class-path to augment the driver classpath
- spark.executor.extraClassPath to augment the executor classpath
  
   14/09/24 23:57:19.246 WARN SparkConf: Setting 
   'spark.executor.extraClassPath' to 'null' as a work-around.
   14/09/24 23:57:19.247 WARN SparkConf: Setting 
   'spark.driver.extraClassPath' to 'null' as a work-around.
  
  However I didn't set SPARK_CLASSPATH env variable. And looked into the 
  SparkConf.scala, If user actually set extraClassPath,  the SparkConf will 
  throw SparkException.
  --
  Ye Xianjin
  Sent with Sparrow (http://www.sparrowmailapp.com/?sig)
  
  
  On Tuesday, September 23, 2014 at 12:56 AM, Ye Xianjin wrote:
  
   Hi:
   I notice the scalatest-maven-plugin set SPARK_CLASSPATH environment 
   variable for testing. But in the SparkConf.scala, this is deprecated in 
   Spark 1.0+.
   So what this variable for? should we just remove this variable?
  
  
   --
   Ye Xianjin
   Sent with Sparrow (http://www.sparrowmailapp.com/?sig)
  
  
 



spark_classpath in core/pom.xml and yarn/porm.xml

2014-09-22 Thread Ye Xianjin
Hi:
I notice the scalatest-maven-plugin set SPARK_CLASSPATH environment 
variable for testing. But in the SparkConf.scala, this is deprecated in Spark 
1.0+.
So what this variable for? should we just remove this variable?


-- 
Ye Xianjin
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)



java_home detection bug in maven 3.2.3

2014-09-18 Thread Ye Xianjin
Hi, Developers:
  I found this bug today on Mac OS X 10.10. 

  Maven version: 3.2.3
  File path: apache-maven-3.2.3/apache-maven/src/bin/mvn  line86
  Code snippet:
  
   if [[ -z $JAVA_HOME  -x /usr/libexec/java_home ]] ; then
 #
 # Apple JDKs
 #
 export JAVA_HOME=/usr/libexec/java_home
   fi
   
   It should be :

   if [[ -z $JAVA_HOME  -x /usr/libexec/java_home ]] ; then
 #
 # Apple JDKs
 #
 export JAVA_HOME=`/usr/libexec/java_home`
   fi

   I wanted to file a jira to http://jira.codehaus.org 
(http://jira.codehaus.org/). But it seems it's not open for registration. So I 
think maybe it's a good idea to send an email here.

-- 
Ye Xianjin
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)



Re: groupBy gives non deterministic results

2014-09-10 Thread Ye Xianjin
Great. And you should ask question in user@spark.apache.org mail list.  I 
believe many people don't subscribe the incubator mail list now.

-- 
Ye Xianjin
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)


On Wednesday, September 10, 2014 at 6:03 PM, redocpot wrote:

 Hi, 
 
 I am using spark 1.0.0. The bug is fixed by 1.0.1.
 
 Hao
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/groupBy-gives-non-deterministic-results-tp13698p13864.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com 
 (http://Nabble.com).
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
 (mailto:user-unsubscr...@spark.apache.org)
 For additional commands, e-mail: user-h...@spark.apache.org 
 (mailto:user-h...@spark.apache.org)
 
 




Re: groupBy gives non deterministic results

2014-09-10 Thread Ye Xianjin
|  Do the two mailing lists share messages ?
I don't think so.  I didn't receive this message from the user list. I am not 
in databricks, so I can't answer your other questions. Maybe Davies Liu 
dav...@databricks.com can answer you?

-- 
Ye Xianjin
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)


On Wednesday, September 10, 2014 at 9:05 PM, redocpot wrote:

 Hi, Xianjin
 
 I checked user@spark.apache.org (mailto:user@spark.apache.org), and found my 
 post there:
 http://mail-archives.apache.org/mod_mbox/spark-user/201409.mbox/browser
 
 I am using nabble to send this mail, which indicates that the mail will be
 sent from my email address to the u...@spark.incubator.apache.org 
 (mailto:u...@spark.incubator.apache.org) mailing
 list.
 
 Do the two mailing lists share messages ?
 
 Do we have a nabble interface for user@spark.apache.org 
 (mailto:user@spark.apache.org) mail list ?
 
 Thank you.
 
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/groupBy-gives-non-deterministic-results-tp13698p13876.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com 
 (http://Nabble.com).
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
 (mailto:user-unsubscr...@spark.apache.org)
 For additional commands, e-mail: user-h...@spark.apache.org 
 (mailto:user-h...@spark.apache.org)
 
 




Re: groupBy gives non deterministic results

2014-09-10 Thread Ye Xianjin
Well, That's weird. I don't see this thread in my mail box as sending to user 
list. Maybe because I also subscribe the incubator mail list? I do see mails 
sending to incubator mail list and no one replies. I thought it was because 
people don't subscribe the incubator now.

-- 
Ye Xianjin
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)


On Thursday, September 11, 2014 at 12:12 AM, Davies Liu wrote:

 I think the mails to spark.incubator.apache.org 
 (http://spark.incubator.apache.org) will be forwarded to
 spark.apache.org (http://spark.apache.org).
 
 Here is the header of the first mail:
 
 from: redocpot julien19890...@gmail.com (mailto:julien19890...@gmail.com)
 to: u...@spark.incubator.apache.org (mailto:u...@spark.incubator.apache.org)
 date: Mon, Sep 8, 2014 at 7:29 AM
 subject: groupBy gives non deterministic results
 mailing list: user.spark.apache.org (http://user.spark.apache.org) Filter 
 messages from this mailing list
 mailed-by: spark.apache.org (http://spark.apache.org)
 
 I only subscribe spark.apache.org (http://spark.apache.org), and I do see all 
 the mails from he.
 
 On Wed, Sep 10, 2014 at 6:29 AM, Ye Xianjin advance...@gmail.com 
 (mailto:advance...@gmail.com) wrote:
  | Do the two mailing lists share messages ?
  I don't think so. I didn't receive this message from the user list. I am
  not in databricks, so I can't answer your other questions. Maybe Davies Liu
  dav...@databricks.com (mailto:dav...@databricks.com) can answer you?
  
  --
  Ye Xianjin
  Sent with Sparrow
  
  On Wednesday, September 10, 2014 at 9:05 PM, redocpot wrote:
  
  Hi, Xianjin
  
  I checked user@spark.apache.org (mailto:user@spark.apache.org), and found 
  my post there:
  http://mail-archives.apache.org/mod_mbox/spark-user/201409.mbox/browser
  
  I am using nabble to send this mail, which indicates that the mail will be
  sent from my email address to the u...@spark.incubator.apache.org 
  (mailto:u...@spark.incubator.apache.org) mailing
  list.
  
  Do the two mailing lists share messages ?
  
  Do we have a nabble interface for user@spark.apache.org 
  (mailto:user@spark.apache.org) mail list ?
  
  Thank you.
  
  
  
  
  --
  View this message in context:
  http://apache-spark-user-list.1001560.n3.nabble.com/groupBy-gives-non-deterministic-results-tp13698p13876.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com 
  (http://Nabble.com).
  
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
  (mailto:user-unsubscr...@spark.apache.org)
  For additional commands, e-mail: user-h...@spark.apache.org 
  (mailto:user-h...@spark.apache.org)
  
 
 
 




Re: groupBy gives non deterministic results

2014-09-09 Thread Ye Xianjin
Can you provide small sample or test data that reproduce this problem? and 
what's your env setup? single node or cluster?

Sent from my iPhone

 On 2014年9月8日, at 22:29, redocpot julien19890...@gmail.com wrote:
 
 Hi,
 
 I have a key-value RDD called rdd below. After a groupBy, I tried to count
 rows.
 But the result is not unique, somehow non deterministic.
 
 Here is the test code:
 
  val step1 = ligneReceipt_cleTable.persist
  val step2 = step1.groupByKey
 
  val s1size = step1.count
  val s2size = step2.count
 
  val t = step2 // rdd after groupBy
 
  val t1 = t.count
  val t2 = t.count
  val t3 = t.count
  val t4 = t.count
  val t5 = t.count
  val t6 = t.count
  val t7 = t.count
  val t8 = t.count
 
  println(s1size =  + s1size)
  println(s2size =  + s2size)
  println(1 =  + t1)
  println(2 =  + t2)
  println(3 =  + t3)
  println(4 =  + t4)
  println(5 =  + t5)
  println(6 =  + t6)
  println(7 =  + t7)
  println(8 =  + t8)
 
 Here are the results:
 
 s1size = 5338864
 s2size = 5268001
 1 = 5268002
 2 = 5268001
 3 = 5268001
 4 = 5268002
 5 = 5268001
 6 = 5268002
 7 = 5268002
 8 = 5268001
 
 Even if the difference is just one row, that's annoying.  
 
 Any idea ?
 
 Thank you.
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/groupBy-gives-non-deterministic-results-tp13698.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
 

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: distcp on ec2 standalone spark cluster

2014-09-08 Thread Ye Xianjin
what did you see in the log? was there anything related to mapreduce?
can you log into your hdfs (data) node, use jps to list all java process and 
confirm whether there is a tasktracker process (or nodemanager) running with 
datanode process


-- 
Ye Xianjin
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)


On Monday, September 8, 2014 at 11:13 PM, Tomer Benyamini wrote:

 Still no luck, even when running stop-all.sh (http://stop-all.sh) followed by 
 start-all.sh (http://start-all.sh).
 
 On Mon, Sep 8, 2014 at 5:57 PM, Nicholas Chammas
 nicholas.cham...@gmail.com (mailto:nicholas.cham...@gmail.com) wrote:
  Tomer,
  
  Did you try start-all.sh (http://start-all.sh)? It worked for me the last 
  time I tried using
  distcp, and it worked for this guy too.
  
  Nick
  
  
  On Mon, Sep 8, 2014 at 3:28 AM, Tomer Benyamini tomer@gmail.com 
  (mailto:tomer@gmail.com) wrote:
   
   ~/ephemeral-hdfs/sbin/start-mapred.sh (http://start-mapred.sh) does not 
   exist on spark-1.0.2;
   
   I restarted hdfs using ~/ephemeral-hdfs/sbin/stop-dfs.sh 
   (http://stop-dfs.sh) and
   ~/ephemeral-hdfs/sbin/start-dfs.sh (http://start-dfs.sh), but still 
   getting the same error
   when trying to run distcp:
   
   ERROR tools.DistCp (DistCp.java:run(126)) - Exception encountered
   
   java.io.IOException: Cannot initialize Cluster. Please check your
   configuration for mapreduce.framework.name 
   (http://mapreduce.framework.name) and the correspond server
   addresses.
   
   at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:121)
   
   at org.apache.hadoop.mapreduce.Cluster.init(Cluster.java:83)
   
   at org.apache.hadoop.mapreduce.Cluster.init(Cluster.java:76)
   
   at org.apache.hadoop.tools.DistCp.createMetaFolderPath(DistCp.java:352)
   
   at org.apache.hadoop.tools.DistCp.execute(DistCp.java:146)
   
   at org.apache.hadoop.tools.DistCp.run(DistCp.java:118)
   
   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
   
   at org.apache.hadoop.tools.DistCp.main(DistCp.java:374)
   
   Any idea?
   
   Thanks!
   Tomer
   
   On Sun, Sep 7, 2014 at 9:27 PM, Josh Rosen rosenvi...@gmail.com 
   (mailto:rosenvi...@gmail.com) wrote:
If I recall, you should be able to start Hadoop MapReduce using
~/ephemeral-hdfs/sbin/start-mapred.sh (http://start-mapred.sh).

On Sun, Sep 7, 2014 at 6:42 AM, Tomer Benyamini tomer@gmail.com 
(mailto:tomer@gmail.com)
wrote:
 
 Hi,
 
 I would like to copy log files from s3 to the cluster's
 ephemeral-hdfs. I tried to use distcp, but I guess mapred is not
 running on the cluster - I'm getting the exception below.
 
 Is there a way to activate it, or is there a spark alternative to
 distcp?
 
 Thanks,
 Tomer
 
 mapreduce.Cluster (Cluster.java:initialize(114)) - Failed to use
 org.apache.hadoop.mapred.LocalClientProtocolProvider due to error:
 Invalid mapreduce.jobtracker.address configuration value for
 LocalJobRunner : XXX:9001
 
 ERROR tools.DistCp (DistCp.java:run(126)) - Exception encountered
 
 java.io.IOException: Cannot initialize Cluster. Please check your
 configuration for mapreduce.framework.name 
 (http://mapreduce.framework.name) and the correspond server
 addresses.
 
 at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:121)
 
 at org.apache.hadoop.mapreduce.Cluster.init(Cluster.java:83)
 
 at org.apache.hadoop.mapreduce.Cluster.init(Cluster.java:76)
 
 at 
 org.apache.hadoop.tools.DistCp.createMetaFolderPath(DistCp.java:352)
 
 at org.apache.hadoop.tools.DistCp.execute(DistCp.java:146)
 
 at org.apache.hadoop.tools.DistCp.run(DistCp.java:118)
 
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
 
 at org.apache.hadoop.tools.DistCp.main(DistCp.java:374)
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
 (mailto:user-unsubscr...@spark.apache.org)
 For additional commands, e-mail: user-h...@spark.apache.org 
 (mailto:user-h...@spark.apache.org)
 


   
   
   -
   To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
   (mailto:user-unsubscr...@spark.apache.org)
   For additional commands, e-mail: user-h...@spark.apache.org 
   (mailto:user-h...@spark.apache.org)
   
  
  
 
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
 (mailto:user-unsubscr...@spark.apache.org)
 For additional commands, e-mail: user-h...@spark.apache.org 
 (mailto:user-h...@spark.apache.org)
 
 




Re: distcp on ec2 standalone spark cluster

2014-09-08 Thread Ye Xianjin
well, this means you didn't start a compute cluster. Most likely because the 
wrong value of mapreduce.jobtracker.address cause the slave node cannot start 
the node manager. ( I am not familiar with the ec2 script, so I don't know 
whether the slave node has node manager installed or not.) 
Can you check the slave node the hadoop daemon log to see whether you started 
the nodemanager  but failed or there is no nodemanager to start? The log file 
location defaults to
/var/log/hadoop-xxx if my memory is correct.

Sent from my iPhone

 On 2014年9月9日, at 0:08, Tomer Benyamini tomer@gmail.com wrote:
 
 No tasktracker or nodemanager. This is what I see:
 
 On the master:
 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager
 org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode
 org.apache.hadoop.hdfs.server.namenode.NameNode
 
 On the data node (slave):
 
 org.apache.hadoop.hdfs.server.datanode.DataNode
 
 
 
 On Mon, Sep 8, 2014 at 6:39 PM, Ye Xianjin advance...@gmail.com wrote:
 what did you see in the log? was there anything related to mapreduce?
 can you log into your hdfs (data) node, use jps to list all java process and
 confirm whether there is a tasktracker process (or nodemanager) running with
 datanode process
 
 --
 Ye Xianjin
 Sent with Sparrow
 
 On Monday, September 8, 2014 at 11:13 PM, Tomer Benyamini wrote:
 
 Still no luck, even when running stop-all.sh followed by start-all.sh.
 
 On Mon, Sep 8, 2014 at 5:57 PM, Nicholas Chammas
 nicholas.cham...@gmail.com wrote:
 
 Tomer,
 
 Did you try start-all.sh? It worked for me the last time I tried using
 distcp, and it worked for this guy too.
 
 Nick
 
 
 On Mon, Sep 8, 2014 at 3:28 AM, Tomer Benyamini tomer@gmail.com wrote:
 
 
 ~/ephemeral-hdfs/sbin/start-mapred.sh does not exist on spark-1.0.2;
 
 I restarted hdfs using ~/ephemeral-hdfs/sbin/stop-dfs.sh and
 ~/ephemeral-hdfs/sbin/start-dfs.sh, but still getting the same error
 when trying to run distcp:
 
 ERROR tools.DistCp (DistCp.java:run(126)) - Exception encountered
 
 java.io.IOException: Cannot initialize Cluster. Please check your
 configuration for mapreduce.framework.name and the correspond server
 addresses.
 
 at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:121)
 
 at org.apache.hadoop.mapreduce.Cluster.init(Cluster.java:83)
 
 at org.apache.hadoop.mapreduce.Cluster.init(Cluster.java:76)
 
 at org.apache.hadoop.tools.DistCp.createMetaFolderPath(DistCp.java:352)
 
 at org.apache.hadoop.tools.DistCp.execute(DistCp.java:146)
 
 at org.apache.hadoop.tools.DistCp.run(DistCp.java:118)
 
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
 
 at org.apache.hadoop.tools.DistCp.main(DistCp.java:374)
 
 Any idea?
 
 Thanks!
 Tomer
 
 On Sun, Sep 7, 2014 at 9:27 PM, Josh Rosen rosenvi...@gmail.com wrote:
 
 If I recall, you should be able to start Hadoop MapReduce using
 ~/ephemeral-hdfs/sbin/start-mapred.sh.
 
 On Sun, Sep 7, 2014 at 6:42 AM, Tomer Benyamini tomer@gmail.com
 wrote:
 
 
 Hi,
 
 I would like to copy log files from s3 to the cluster's
 ephemeral-hdfs. I tried to use distcp, but I guess mapred is not
 running on the cluster - I'm getting the exception below.
 
 Is there a way to activate it, or is there a spark alternative to
 distcp?
 
 Thanks,
 Tomer
 
 mapreduce.Cluster (Cluster.java:initialize(114)) - Failed to use
 org.apache.hadoop.mapred.LocalClientProtocolProvider due to error:
 Invalid mapreduce.jobtracker.address configuration value for
 LocalJobRunner : XXX:9001
 
 ERROR tools.DistCp (DistCp.java:run(126)) - Exception encountered
 
 java.io.IOException: Cannot initialize Cluster. Please check your
 configuration for mapreduce.framework.name and the correspond server
 addresses.
 
 at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:121)
 
 at org.apache.hadoop.mapreduce.Cluster.init(Cluster.java:83)
 
 at org.apache.hadoop.mapreduce.Cluster.init(Cluster.java:76)
 
 at org.apache.hadoop.tools.DistCp.createMetaFolderPath(DistCp.java:352)
 
 at org.apache.hadoop.tools.DistCp.execute(DistCp.java:146)
 
 at org.apache.hadoop.tools.DistCp.run(DistCp.java:118)
 
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
 
 at org.apache.hadoop.tools.DistCp.main(DistCp.java:374)
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
 
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
 
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
 
 

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail

Re: about spark assembly jar

2014-09-02 Thread Ye Xianjin
Sorry, The quick reply didn't cc the dev list.

Sean, sometimes I have to use the spark-shell to confirm some behavior change. 
In that case, I have to reassembly the whole project.  is there another way 
around, not use the the big jar in development? For the original question, I 
have no comments. 

-- 
Ye Xianjin
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)


On Tuesday, September 2, 2014 at 4:58 PM, Sean Owen wrote:

 No, usually you unit-test your changes during development. That
 doesn't require the assembly. Eventually you may wish to test some
 change against the complete assembly.
 
 But that's a different question; I thought you were suggesting that
 the assembly JAR should never be created.
 
 On Tue, Sep 2, 2014 at 9:53 AM, Ye Xianjin advance...@gmail.com 
 (mailto:advance...@gmail.com) wrote:
  Hi, Sean:
  In development, do I really need to reassembly the whole project even if I
  only change a line or two code in one component?
  I used to that but found time-consuming.
  
  --
  Ye Xianjin
  Sent with Sparrow
  
  On Tuesday, September 2, 2014 at 4:45 PM, Sean Owen wrote:
  
  Hm, are you suggesting that the Spark distribution be a bag of 100
  JARs? It doesn't quite seem reasonable. It does not remove version
  conflicts, just pushes them to run-time, which isn't good. The
  assembly is also necessary because that's where shading happens. In
  development, you want to run against exactly what will be used in a
  real Spark distro.
  
  On Tue, Sep 2, 2014 at 9:39 AM, scwf wangf...@huawei.com 
  (mailto:wangf...@huawei.com) wrote:
  
  hi, all
  I suggest spark not use assembly jar as default run-time
  dependency(spark-submit/spark-class depend on assembly jar),use a library of
  all 3rd dependency jar like hadoop/hive/hbase more reasonable.
  
  1 assembly jar packaged all 3rd jars into a big one, so we need rebuild
  this jar if we want to update the version of some component(such as hadoop)
  2 in our practice with spark, sometimes we meet jar compatibility issue,
  it is hard to diagnose compatibility issue with assembly jar
  
  
  
  
  
  
  
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org 
  (mailto:dev-unsubscr...@spark.apache.org)
  For additional commands, e-mail: dev-h...@spark.apache.org 
  (mailto:dev-h...@spark.apache.org)
  
  
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org 
  (mailto:dev-unsubscr...@spark.apache.org)
  For additional commands, e-mail: dev-h...@spark.apache.org 
  (mailto:dev-h...@spark.apache.org)
  
 
 
 




[jira] [Commented] (SPARK-3098) In some cases, operation zipWithIndex get a wrong results

2014-09-01 Thread Ye Xianjin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14117558#comment-14117558
 ] 

Ye Xianjin commented on SPARK-3098:
---

hi, [~srowen] and [~gq], I think what [~matei] wants to say is that because the 
ordering of elements in distinct() is not guaranteed, the result of 
zipWithIndex is not deterministic. If you recompute the RDD with distinct 
transformation, you are not guaranteed to get the same result. That explains 
the behavior here.

But as [~srowen] said, It's surprised to see different results from the same 
RDD. [~matei], what do you think about this behavior?

  In some cases, operation zipWithIndex get a wrong results
 --

 Key: SPARK-3098
 URL: https://issues.apache.org/jira/browse/SPARK-3098
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.1
Reporter: Guoqiang Li
Priority: Critical

 The reproduce code:
 {code}
  val c = sc.parallelize(1 to 7899).flatMap { i =
   (1 to 1).toSeq.map(p = i * 6000 + p)
 }.distinct().zipWithIndex() 
 c.join(c).filter(t = t._2._1 != t._2._2).take(3)
 {code}
  = 
 {code}
  Array[(Int, (Long, Long))] = Array((1732608,(11,12)), (45515264,(12,13)), 
 (36579712,(13,14)))
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.1.0 (RC2)

2014-08-29 Thread Ye Xianjin
We just used CDH 4.7 for our production cluster. And I believe we won't use CDH 
5 in the next year.

Sent from my iPhone

 On 2014年8月29日, at 14:39, Matei Zaharia matei.zaha...@gmail.com wrote:
 
 Personally I'd actually consider putting CDH4 back if there are still users 
 on it. It's always better to be inclusive, and the convenience of a one-click 
 download is high. Do we have a sense on what % of CDH users still use CDH4?
 
 Matei
 
 On August 28, 2014 at 11:31:13 PM, Sean Owen (so...@cloudera.com) wrote:
 
 (Copying my reply since I don't know if it goes to the mailing list) 
 
 Great, thanks for explaining the reasoning. You're saying these aren't 
 going into the final release? I think that moots any issue surrounding 
 distributing them then. 
 
 This is all I know of from the ASF: 
 https://community.apache.org/projectIndependence.html I don't read it 
 as expressly forbidding this kind of thing although you can see how it 
 bumps up against the spirit. There's not a bright line -- what about 
 Tomcat providing binaries compiled for Windows for example? does that 
 favor an OS vendor? 
 
 From this technical ASF perspective only the releases matter -- do 
 what you want with snapshots and RCs. The only issue there is maybe 
 releasing something different than was in the RC; is that at all 
 confusing? Just needs a note. 
 
 I think this theoretical issue doesn't exist if these binaries aren't 
 released, so I see no reason to not proceed. 
 
 The rest is a different question about whether you want to spend time 
 maintaining this profile and candidate. The vendor already manages 
 their build I think and -- and I don't know -- may even prefer not to 
 have a different special build floating around. There's also the 
 theoretical argument that this turns off other vendors from adopting 
 Spark if it's perceived to be too connected to other vendors. I'd like 
 to maximize Spark's distribution and there's some argument you do this 
 by not making vendor profiles. But as I say a different question to 
 just think about over time... 
 
 (oh and PS for my part I think it's a good thing that CDH4 binaries 
 were removed. I wasn't arguing for resurrecting them) 
 
 On Fri, Aug 29, 2014 at 7:26 AM, Patrick Wendell pwend...@gmail.com wrote: 
 Hey Sean, 
 
 The reason there are no longer CDH-specific builds is that all newer 
 versions of CDH and HDP work with builds for the upstream Hadoop 
 projects. I dropped CDH4 in favor of a newer Hadoop version (2.4) and 
 the Hadoop-without-Hive (also 2.4) build. 
 
 For MapR - we can't officially post those artifacts on ASF web space 
 when we make the final release, we can only link to them as being 
 hosted by MapR specifically since they use non-compatible licenses. 
 However, I felt that providing these during a testing period was 
 alright, with the goal of increasing test coverage. I couldn't find 
 any policy against posting these on personal web space during RC 
 voting. However, we can remove them if there is one. 
 
 Dropping CDH4 was more because it is now pretty old, but we can add it 
 back if people want. The binary packaging is a slightly separate 
 question from release votes, so I can always add more binary packages 
 whenever. And on this, my main concern is covering the most popular 
 Hadoop versions to lower the bar for users to build and test Spark. 
 
 - Patrick 
 
 On Thu, Aug 28, 2014 at 11:04 PM, Sean Owen so...@cloudera.com wrote: 
 +1 I tested the source and Hadoop 2.4 release. Checksums and 
 signatures are OK. Compiles fine with Java 8 on OS X. Tests... don't 
 fail any more than usual. 
 
 FWIW I've also been using the 1.1.0-SNAPSHOT for some time in another 
 project and have encountered no problems. 
 
 
 I notice that the 1.1.0 release removes the CDH4-specific build, but 
 adds two MapR-specific builds. Compare with 
 https://dist.apache.org/repos/dist/release/spark/spark-1.0.2/ I 
 commented on the commit: 
 https://github.com/apache/spark/commit/ceb19830b88486faa87ff41e18d03ede713a73cc
  
 
 I'm in favor of removing all vendor-specific builds. This change 
 *looks* a bit funny as there was no JIRA (?) and appears to swap one 
 vendor for another. Of course there's nothing untoward going on, but 
 what was the reasoning? It's best avoided, and MapR already 
 distributes Spark just fine, no? 
 
 This is a gray area with ASF projects. I mention it as well because it 
 came up with Apache Flink recently 
 (http://mail-archives.eu.apache.org/mod_mbox/incubator-flink-dev/201408.mbox/%3CCANC1h_u%3DN0YKFu3pDaEVYz5ZcQtjQnXEjQA2ReKmoS%2Bye7%3Do%3DA%40mail.gmail.com%3E)
 Another vendor rightly noted this could look like favoritism. They 
 changed to remove vendor releases. 
 
 On Fri, Aug 29, 2014 at 3:14 AM, Patrick Wendell pwend...@gmail.com 
 wrote: 
 Please vote on releasing the following candidate as Apache Spark version 
 1.1.0! 
 
 The tag to be voted on is v1.1.0-rc2 (commit 711aebb3): 
 

Re: Too many open files

2014-08-29 Thread Ye Xianjin
Ops,the last reply didn't go to the user list.  Mail app's fault.

Shuffling happens in the cluster, so you need change all the nodes in the 
cluster.



Sent from my iPhone

 On 2014年8月30日, at 3:10, Sudha Krishna skrishna...@gmail.com wrote:
 
 Hi,
 
 Thanks for your response. Do you know if I need to change this limit on all 
 the cluster nodes or just the master?
 Thanks
 
 On Aug 29, 2014 11:43 AM, Ye Xianjin advance...@gmail.com wrote:
 1024 for the number of file limit is most likely too small for Linux 
 Machines on production. Try to set to 65536 or unlimited if you can. The too 
 many open files error occurs because there are a lot of shuffle files(if 
 wrong, please correct me):
 
 Sent from my iPhone
 
  On 2014年8月30日, at 2:06, SK skrishna...@gmail.com wrote:
 
  Hi,
 
  I am having the same problem reported by Michael. I am trying to open 30
  files. ulimit -n  shows the limit is 1024. So I am not sure why the program
  is failing with  Too many open files error. The total size of all the 30
  files is 230 GB.
  I am running the job on a cluster with 10 nodes, each having 16 GB. The
  error appears to be happening at the distinct() stage.
 
  Here is my program. In the following code, are all the 10 nodes trying to
  open all of the 30 files or are the files distributed among the 30 nodes?
 
 val baseFile = /mapr/mapr_dir/files_2013apr*
 valx = sc.textFile(baseFile)).map { line =
 val
  fields = line.split(\t)
 
  (fields(11), fields(6))
 
  }.distinct().countByKey()
 val xrdd = sc.parallelize(x.toSeq)
 xrdd.saveAsTextFile(...)
 
  Instead of using the glob *, I guess I can try using a for loop to read the
  files one by one if that helps, but not sure if there is a more efficient
  solution.
 
  The following is the error transcript:
 
  Job aborted due to stage failure: Task 1.0:201 failed 4 times, most recent
  failure: Exception failure in TID 902 on host 192.168.13.11:
  java.io.FileNotFoundException:
  /tmp/spark-local-20140829131200-0bb7/08/shuffle_0_201_999 (Too many open
  files)
  java.io.FileOutputStream.open(Native Method)
  java.io.FileOutputStream.init(FileOutputStream.java:221)
  org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:116)
  org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:177)
  org.apache.spark.scheduler.ShuffleMapTask$$anonfun$runTask$1.apply(ShuffleMapTask.scala:161)
  org.apache.spark.scheduler.ShuffleMapTask$$anonfun$runTask$1.apply(ShuffleMapTask.scala:158)
  scala.collection.Iterator$class.foreach(Iterator.scala:727)
  org.apache.spark.util.collection.AppendOnlyMap$$anon$1.foreach(AppendOnlyMap.scala:159)
  org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158)
  org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
  org.apache.spark.scheduler.Task.run(Task.scala:51)
  org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183)
  java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
  java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
  java.lang.Thread.run(Thread.java:744) Driver stacktrace:
 
 
 
 
 
  --
  View this message in context: 
  http://apache-spark-user-list.1001560.n3.nabble.com/Too-many-open-files-tp1464p13144.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 


[jira] [Created] (SPARK-3040) pick up a more proper local ip address for Utils.findLocalIpAddress method

2014-08-14 Thread Ye Xianjin (JIRA)
Ye Xianjin created SPARK-3040:
-

 Summary: pick up a more proper local ip address for 
Utils.findLocalIpAddress method
 Key: SPARK-3040
 URL: https://issues.apache.org/jira/browse/SPARK-3040
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.2
 Environment: Mac os x, a bunch of network interfaces: eth0, wlan0, 
vnic0, vnic1, tun0, lo
Reporter: Ye Xianjin
Priority: Trivial


I noticed this inconvenience when I ran spark-shell with my virtual machines on 
and VPN service running.

There are a lot of network interfaces on my laptop(inactive devices omitted):
{quote}
lo0: inet 127.0.0.1
en1: inet 192.168.0.102
vnic0: inet 10.211.55.2 (virtual if for vm1)
vnic1: inet 10.37.129.3 (virtual if for vm2)
tun0: inet 172.16.100.191 -- 172.16.100.191 (tun device for VPN)
{quote}

In spark core, Utils.findLocalIpAddress() uses 
NetworkInterface.getNetworkInterfaces to get all active network interfaces, but 
unfortunately, this method returns network interfaces in reverse order compared 
to the ifconfig output (both use ioctl sys call). I dug into the openJDK 6 and 
7 source code and confirms this behavior(It just happens on unix-like system, 
windows deals with it and returns in index order). So, the findLocalIpAddress 
method will pick the ip address associated with tun0 rather than en1




--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: defaultMinPartitions in textFile

2014-07-21 Thread Ye Xianjin
well, I think you miss this line of code in SparkContext.scala
line 1242-1243(master):
 /** Default min number of partitions for Hadoop RDDs when not given by user */
  def defaultMinPartitions: Int = math.min(defaultParallelism, 2)

so the defaultMinPartitions will be 2 unless the defaultParallelism is less 
than 2...


--  
Ye Xianjin
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)


On Tuesday, July 22, 2014 at 10:18 AM, Wang, Jensen wrote:

 Hi,  
 I started to use spark on yarn recently and found a problem while 
 tuning my program.
   
 When SparkContext is initialized as sc and ready to read text file from hdfs, 
 the textFile(path, defaultMinPartitions) method is called.
 I traced down the second parameter in the spark source code and finally found 
 this:
conf.getInt(spark.default.parallelism, math.max(totalCoreCount.get(), 
 2))  in  CoarseGrainedSchedulerBackend.scala
   
 I do not specify the property “spark.default.parallelism” anywhere so the 
 getInt will return value from the larger one between totalCoreCount and 2.
   
 When I submit the application using spark-submit and specify the parameter: 
 --num-executors  2   --executor-cores 6, I suppose the totalCoreCount will be 
  
 2*6 = 12, so defaultMinPartitions will be 12.
   
 But when I print the value of defaultMinPartitions in my program, I still get 
 2 in return,  How does this happen, or where do I make a mistake?
  
  
  




[jira] [Created] (SPARK-2557) createTaskScheduler should be consistent between local and local-n-failures

2014-07-17 Thread Ye Xianjin (JIRA)
Ye Xianjin created SPARK-2557:
-

 Summary: createTaskScheduler should be consistent between local 
and local-n-failures 
 Key: SPARK-2557
 URL: https://issues.apache.org/jira/browse/SPARK-2557
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Ye Xianjin
Priority: Minor


In SparkContext.createTaskScheduler, we can use {code}local[*]{code} to 
estimates the number of cores on the machine. I think we should also be able to 
use * in the local-n-failures mode.

And according to the code in the LOCAL_N_REGEX pattern matching code, I believe 
the regular expression of LOCAL_N_REGEX is wrong. LOCAL_N_REFEX should be 
{code}
local\[([0-9]+|\*)\].r
{code} 
rather than
{code}
 local\[([0-9\*]+)\].r
{code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2557) createTaskScheduler should be consistent between local and local-n-failures

2014-07-17 Thread Ye Xianjin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14065001#comment-14065001
 ] 

Ye Xianjin commented on SPARK-2557:
---

I will send a pr for this.

 createTaskScheduler should be consistent between local and local-n-failures 
 

 Key: SPARK-2557
 URL: https://issues.apache.org/jira/browse/SPARK-2557
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Ye Xianjin
Priority: Minor
  Labels: starter
   Original Estimate: 2h
  Remaining Estimate: 2h

 In SparkContext.createTaskScheduler, we can use {code}local[*]{code} to 
 estimates the number of cores on the machine. I think we should also be able 
 to use * in the local-n-failures mode.
 And according to the code in the LOCAL_N_REGEX pattern matching code, I 
 believe the regular expression of LOCAL_N_REGEX is wrong. LOCAL_N_REFEX 
 should be 
 {code}
 local\[([0-9]+|\*)\].r
 {code} 
 rather than
 {code}
  local\[([0-9\*]+)\].r
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2557) createTaskScheduler should be consistent between local and local-n-failures

2014-07-17 Thread Ye Xianjin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14065029#comment-14065029
 ] 

Ye Xianjin commented on SPARK-2557:
---

Github pr: https://github.com/apache/spark/pull/1464

 createTaskScheduler should be consistent between local and local-n-failures 
 

 Key: SPARK-2557
 URL: https://issues.apache.org/jira/browse/SPARK-2557
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Ye Xianjin
Priority: Minor
  Labels: starter
   Original Estimate: 2h
  Remaining Estimate: 2h

 In SparkContext.createTaskScheduler, we can use {code}local[*]{code} to 
 estimates the number of cores on the machine. I think we should also be able 
 to use * in the local-n-failures mode.
 And according to the code in the LOCAL_N_REGEX pattern matching code, I 
 believe the regular expression of LOCAL_N_REGEX is wrong. LOCAL_N_REFEX 
 should be 
 {code}
 local\[([0-9]+|\*)\].r
 {code} 
 rather than
 {code}
  local\[([0-9\*]+)\].r
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Re: Where to set proxy in order to run ./install-dev.sh for SparkR

2014-07-02 Thread Ye Xianjin
You can try setting your HTTP_PROXY environment variable.

export HTTP_PROXY=host:port

But I don't use maven. If the env variable doesn't work, please search google 
for maven proxy. I am sure there will be a lot of related results.

Sent from my iPhone

 On 2014年7月2日, at 19:04, Stuti Awasthi stutiawas...@hcl.com wrote:
 
 Hi,
  
 I wanted to build SparkR from source but running the script behind the proxy. 
 Where shall I set proxy host and port in order to build the source. Issue is 
 not able to download dependencies from Maven
  
 Thanks
 Stuti Awasthi
  
 
 
 ::DISCLAIMER::
 
 The contents of this e-mail and any attachment(s) are confidential and 
 intended for the named recipient(s) only.
 E-mail transmission is not guaranteed to be secure or error-free as 
 information could be intercepted, corrupted, 
 lost, destroyed, arrive late or incomplete, or may contain viruses in 
 transmission. The e mail and its contents 
 (with or without referred errors) shall therefore not attach any liability on 
 the originator or HCL or its affiliates. 
 Views or opinions, if any, presented in this email are solely those of the 
 author and may not necessarily reflect the 
 views or opinions of HCL or its affiliates. Any form of reproduction, 
 dissemination, copying, disclosure, modification, 
 distribution and / or publication of this message without the prior written 
 consent of authorized representative of 
 HCL is strictly prohibited. If you have received this email in error please 
 delete it and notify the sender immediately. 
 Before opening any email and/or attachments, please check them for viruses 
 and other defects.
 


Re: Set comparison

2014-06-16 Thread Ye Xianjin
If you want string with quotes, you have to escape it with '\'. It's exactly 
what you did in the modified version.

Sent from my iPhone

 On 2014年6月17日, at 5:43, SK skrishna...@gmail.com wrote:
 
 In Line 1, I have expected_res as a set of strings with quotes. So I thought
 it would include the quotes during comparison.
 
 Anyway I modified expected_res = Set(\ID1\, \ID2\, \ID3\) and
 that seems to work.
 
 thanks.
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Set-comparison-tp7696p7699.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.


[jira] [Closed] (SPARK-1511) Update TestUtils.createCompiledClass() API to work with creating class file on different filesystem

2014-04-17 Thread Ye Xianjin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ye Xianjin closed SPARK-1511.
-

   Resolution: Fixed
Fix Version/s: 1.0.0

 Update TestUtils.createCompiledClass() API to work with creating class file 
 on different filesystem
 ---

 Key: SPARK-1511
 URL: https://issues.apache.org/jira/browse/SPARK-1511
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.8.1, 0.9.0, 1.0.0
 Environment: Mac OS X, two disks. 
Reporter: Ye Xianjin
Priority: Minor
  Labels: starter
 Fix For: 1.0.0

   Original Estimate: 24h
  Remaining Estimate: 24h

 The createCompliedClass method uses java File.renameTo method to rename 
 source file to destination file, which will fail if source and destination 
 files are on different disks (or partitions).
 see 
 http://apache-spark-developers-list.1001551.n3.nabble.com/Tests-failed-after-assembling-the-latest-code-from-github-td6315.html
  for more details.
 Use com.google.common.io.Files.move instead of renameTo will solve this issue.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1527) rootDirs in DiskBlockManagerSuite doesn't get full path from rootDir0, rootDir1

2014-04-17 Thread Ye Xianjin (JIRA)
Ye Xianjin created SPARK-1527:
-

 Summary: rootDirs in DiskBlockManagerSuite doesn't get full path 
from rootDir0, rootDir1
 Key: SPARK-1527
 URL: https://issues.apache.org/jira/browse/SPARK-1527
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.9.0
Reporter: Ye Xianjin
Priority: Minor


In core/src/test/scala/org/apache/storage/DiskBlockManagerSuite.scala

  val rootDir0 = Files.createTempDir()
  rootDir0.deleteOnExit()
  val rootDir1 = Files.createTempDir()
  rootDir1.deleteOnExit()
  val rootDirs = rootDir0.getName + , + rootDir1.getName

rootDir0 and rootDir1 are in system's temporary directory. 
rootDir0.getName will not get the full path of the directory but the last 
component of the directory. When passing to DiskBlockManage constructor, the 
DiskBlockerManger creates directories in pwd not the temporary directory.

rootDir0.toString will fix this issue.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1527) rootDirs in DiskBlockManagerSuite doesn't get full path from rootDir0, rootDir1

2014-04-17 Thread Ye Xianjin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13973087#comment-13973087
 ] 

Ye Xianjin commented on SPARK-1527:
---

Yes. You are right. toString() may give relative path. And since it's 
determined by java.io.tmpdir system property. see 
https://code.google.com/p/guava-libraries/source/browse/guava/src/com/google/common/io/Files.java
 line 591. It's possible that the DiskBlockManager will create different 
directories than the original temp dir when java.io.tmpdir is a relative path. 

so use getAbsolutePath since I use this method in my last pr?

But, I saw toString() was called other places! Should we do something about 
that?

 rootDirs in DiskBlockManagerSuite doesn't get full path from rootDir0, 
 rootDir1
 ---

 Key: SPARK-1527
 URL: https://issues.apache.org/jira/browse/SPARK-1527
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.9.0
Reporter: Ye Xianjin
Priority: Minor
  Labels: starter
   Original Estimate: 24h
  Remaining Estimate: 24h

 In core/src/test/scala/org/apache/storage/DiskBlockManagerSuite.scala
   val rootDir0 = Files.createTempDir()
   rootDir0.deleteOnExit()
   val rootDir1 = Files.createTempDir()
   rootDir1.deleteOnExit()
   val rootDirs = rootDir0.getName + , + rootDir1.getName
 rootDir0 and rootDir1 are in system's temporary directory. 
 rootDir0.getName will not get the full path of the directory but the last 
 component of the directory. When passing to DiskBlockManage constructor, the 
 DiskBlockerManger creates directories in pwd not the temporary directory.
 rootDir0.toString will fix this issue.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1527) rootDirs in DiskBlockManagerSuite doesn't get full path from rootDir0, rootDir1

2014-04-17 Thread Ye Xianjin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13973096#comment-13973096
 ] 

Ye Xianjin commented on SPARK-1527:
---

Yes, of course, sometimes we want absolute path, sometimes we want to transmit 
a relative path. It depends on logic. 
But I think maybe we should review these usages so that we can make sure 
absolute paths or relative paths are used appropriately.

I may have time to review it after I finish another JIRA issue. If you want to 
take it over, please!

Anyway, thanks for your comments and help.


 rootDirs in DiskBlockManagerSuite doesn't get full path from rootDir0, 
 rootDir1
 ---

 Key: SPARK-1527
 URL: https://issues.apache.org/jira/browse/SPARK-1527
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.9.0
Reporter: Ye Xianjin
Priority: Minor
  Labels: starter
   Original Estimate: 24h
  Remaining Estimate: 24h

 In core/src/test/scala/org/apache/storage/DiskBlockManagerSuite.scala
   val rootDir0 = Files.createTempDir()
   rootDir0.deleteOnExit()
   val rootDir1 = Files.createTempDir()
   rootDir1.deleteOnExit()
   val rootDirs = rootDir0.getName + , + rootDir1.getName
 rootDir0 and rootDir1 are in system's temporary directory. 
 rootDir0.getName will not get the full path of the directory but the last 
 component of the directory. When passing to DiskBlockManage constructor, the 
 DiskBlockerManger creates directories in pwd not the temporary directory.
 rootDir0.toString will fix this issue.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1511) Update TestUtils.createCompiledClass() API to work with creating class file on different filesystem

2014-04-16 Thread Ye Xianjin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ye Xianjin updated SPARK-1511:
--

Affects Version/s: 0.8.1
   0.9.0

 Update TestUtils.createCompiledClass() API to work with creating class file 
 on different filesystem
 ---

 Key: SPARK-1511
 URL: https://issues.apache.org/jira/browse/SPARK-1511
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.8.1, 0.9.0, 1.0.0
 Environment: Mac OS X, two disks. 
Reporter: Ye Xianjin
Priority: Minor
  Labels: starter
   Original Estimate: 24h
  Remaining Estimate: 24h

 The createCompliedClass method uses java File.renameTo method to rename 
 source file to destination file, which will fail if source and destination 
 files are on different disks (or partitions).
 see 
 http://apache-spark-developers-list.1001551.n3.nabble.com/Tests-failed-after-assembling-the-latest-code-from-github-td6315.html
  for more details.
 Use com.google.common.io.Files.move instead of renameTo will solve this issue.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Re: Tests failed after assembling the latest code from github

2014-04-14 Thread Ye Xianjin
Thank you for your reply. 

After building the assembly jar, the repl test still failed. The error output 
is same as I post before. 

-- 
Ye Xianjin
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)


On Tuesday, April 15, 2014 at 1:39 AM, Michael Armbrust wrote:

 I believe you may need an assembly jar to run the ReplSuite. sbt/sbt
 assembly/assembly.
 
 Michael
 
 
 On Mon, Apr 14, 2014 at 3:14 AM, Ye Xianjin advance...@gmail.com 
 (mailto:advance...@gmail.com) wrote:
 
  Hi, everyone:
  I am new to Spark development. I download spark's latest code from github.
  After running sbt/sbt assembly,
  I began running sbt/sbt test in the spark source code dir. But it failed
  running the repl module test.
  
  Here are some output details.
  
  command:
  sbt/sbt test-only org.apache.spark.repl.*
  output:
  
  [info] Loading project definition from
  /Volumes/MacintoshHD/github/spark/project/project
  [info] Loading project definition from
  /Volumes/MacintoshHD/github/spark/project
  [info] Set current project to root (in build
  file:/Volumes/MacintoshHD/github/spark/)
  [info] Passed: Total 0, Failed 0, Errors 0, Passed 0
  [info] No tests to run for graphx/test:testOnly
  [info] Passed: Total 0, Failed 0, Errors 0, Passed 0
  [info] No tests to run for bagel/test:testOnly
  [info] Passed: Total 0, Failed 0, Errors 0, Passed 0
  [info] No tests to run for streaming/test:testOnly
  [info] Passed: Total 0, Failed 0, Errors 0, Passed 0
  [info] No tests to run for mllib/test:testOnly
  [info] Passed: Total 0, Failed 0, Errors 0, Passed 0
  [info] No tests to run for catalyst/test:testOnly
  [info] Passed: Total 0, Failed 0, Errors 0, Passed 0
  [info] No tests to run for core/test:testOnly
  [info] Passed: Total 0, Failed 0, Errors 0, Passed 0
  [info] No tests to run for assembly/test:testOnly
  [info] Passed: Total 0, Failed 0, Errors 0, Passed 0
  [info] No tests to run for sql/test:testOnly
  [info] ExecutorClassLoaderSuite:
  2014-04-14 16:59:31.247 java[8393:1003] Unable to load realm info from
  SCDynamicStore
  [info] - child first *** FAILED *** (440 milliseconds)
  [info] java.lang.ClassNotFoundException: ReplFakeClass2
  [info] at java.lang.ClassLoader.findClass(ClassLoader.java:364)
  [info] at
  org.apache.spark.util.ParentClassLoader.findClass(ParentClassLoader.scala:26)
  [info] at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
  [info] at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
  [info] at
  org.apache.spark.util.ParentClassLoader.loadClass(ParentClassLoader.scala:30)
  [info] at
  org.apache.spark.repl.ExecutorClassLoader$$anonfun$findClass$1.apply(ExecutorClassLoader.scala:57)
  [info] at
  org.apache.spark.repl.ExecutorClassLoader$$anonfun$findClass$1.apply(ExecutorClassLoader.scala:57)
  [info] at scala.Option.getOrElse(Option.scala:120)
  [info] at
  org.apache.spark.repl.ExecutorClassLoader.findClass(ExecutorClassLoader.scala:57)
  [info] at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
  [info] at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
  [info] at
  org.apache.spark.repl.ExecutorClassLoaderSuite$$anonfun$1.apply$mcV$sp(ExecutorClassLoaderSuite.scala:47)
  [info] at
  org.apache.spark.repl.ExecutorClassLoaderSuite$$anonfun$1.apply(ExecutorClassLoaderSuite.scala:44)
  [info] at
  org.apache.spark.repl.ExecutorClassLoaderSuite$$anonfun$1.apply(ExecutorClassLoaderSuite.scala:44)
  [info] at org.scalatest.FunSuite$$anon$1.apply(FunSuite.scala:1265)
  [info] at org.scalatest.Suite$class.withFixture(Suite.scala:1974)
  [info] at
  org.apache.spark.repl.ExecutorClassLoaderSuite.withFixture(ExecutorClassLoaderSuite.scala:30)
  [info] at
  org.scalatest.FunSuite$class.invokeWithFixture$1(FunSuite.scala:1262)
  [info] at
  org.scalatest.FunSuite$$anonfun$runTest$1.apply(FunSuite.scala:1271)
  [info] at
  org.scalatest.FunSuite$$anonfun$runTest$1.apply(FunSuite.scala:1271)
  [info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:198)
  [info] at org.scalatest.FunSuite$class.runTest(FunSuite.scala:1271)
  [info] at
  org.apache.spark.repl.ExecutorClassLoaderSuite.runTest(ExecutorClassLoaderSuite.scala:30)
  [info] at
  org.scalatest.FunSuite$$anonfun$runTests$1.apply(FunSuite.scala:1304)
  [info] at
  org.scalatest.FunSuite$$anonfun$runTests$1.apply(FunSuite.scala:1304)
  [info] at
  org.scalatest.SuperEngine$$anonfun$org$scalatest$SuperEngine$$runTestsInBranch$1.apply(Engine.scala:260)
  [info] at
  org.scalatest.SuperEngine$$anonfun$org$scalatest$SuperEngine$$runTestsInBranch$1.apply(Engine.scala:249)
  [info] at scala.collection.immutable.List.foreach(List.scala:318)
  [info] at org.scalatest.SuperEngine.org 
  (http://org.scalatest.SuperEngine.org)
  $scalatest$SuperEngine$$runTestsInBranch(Engine.scala:249)
  [info] at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:326)
  [info] at org.scalatest.FunSuite$class.runTests(FunSuite.scala:1304)
  [info] at
  org.apache.spark.repl.ExecutorClassLoaderSuite.runTests

Re: Tests failed after assembling the latest code from github

2014-04-14 Thread Ye Xianjin
Hi, I think I have found the cause of the tests failing. 

I have two disks on my laptop. The spark project dir is on an HDD disk while 
the tempdir created by google.io.Files.createTempDir is the 
/var/folders/5q/ ,which is on the system disk, an SSD.
The ExecutorLoaderSuite test uses 
org.apache.spark.TestUtils.createdCompiledClass methods.
The createCompiledClass method first generates the compiled class in the 
pwd(spark/repl), thens use renameTo to move
the file. The renameTo method fails because the dest file is in a different 
filesystem than the source file.

I modify the TestUtils.scala to first copy the file to dest then delete the 
original file. The tests go smoothly.
Should I issue an jira about this problem? Then I can send a pr on Github.

-- 
Ye Xianjin
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)


On Tuesday, April 15, 2014 at 3:43 AM, Ye Xianjin wrote:

 well. This is very strange. 
 I looked into ExecutorClassLoaderSuite.scala and ReplSuite.scala and made 
 small changes to ExecutorClassLoaderSuite.scala (mostly output some internal 
 variables). After that, when running repl test, I noticed the ReplSuite  
 was tested first and the test result is ok. But the ExecutorClassLoaderSuite 
 test was weird.
 Here is the output:
 [info] ExecutorClassLoaderSuite:
 [error] Uncaught exception when running 
 org.apache.spark.repl.ExecutorClassLoaderSuite: java.lang.OutOfMemoryError: 
 PermGen space
 [error] Uncaught exception when running 
 org.apache.spark.repl.ExecutorClassLoaderSuite: java.lang.OutOfMemoryError: 
 PermGen space
 Internal error when running tests: java.lang.OutOfMemoryError: PermGen space
 Exception in thread Thread-3 java.io.EOFException
 at 
 java.io.ObjectInputStream$BlockDataInputStream.peekByte(ObjectInputStream.java:2577)
 at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1297)
 at java.io.ObjectInputStream.readArray(ObjectInputStream.java:1685)
 at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1323)
 at java.io.ObjectInputStream.readObject(ObjectInputStream.java:349)
 at sbt.React.react(ForkTests.scala:116)
 at sbt.ForkTests$$anonfun$mainTestTask$1$Acceptor$2$.run(ForkTests.scala:75)
 at java.lang.Thread.run(Thread.java:695)
 
 
 I revert my changes. The test result is same.
 
  I touched the ReplSuite.scala file (use touch command), the test order is 
 reversed, same as the very beginning. And the output is also the same.(The 
 result in my first post).
 
 
 -- 
 Ye Xianjin
 Sent with Sparrow (http://www.sparrowmailapp.com/?sig)
 
 
 On Tuesday, April 15, 2014 at 3:14 AM, Aaron Davidson wrote:
 
  This may have something to do with running the tests on a Mac, as there is
  a lot of File/URI/URL stuff going on in that test which may just have
  happened to work if run on a Linux system (like Jenkins). Note that this
  suite was added relatively recently:
  https://github.com/apache/spark/pull/217
  
  
  On Mon, Apr 14, 2014 at 12:04 PM, Ye Xianjin advance...@gmail.com 
  (mailto:advance...@gmail.com) wrote:
  
   Thank you for your reply.
   
   After building the assembly jar, the repl test still failed. The error
   output is same as I post before.
   
   --
   Ye Xianjin
   Sent with Sparrow (http://www.sparrowmailapp.com/?sig)
   
   
   On Tuesday, April 15, 2014 at 1:39 AM, Michael Armbrust wrote:
   
I believe you may need an assembly jar to run the ReplSuite. sbt/sbt
assembly/assembly.

Michael


On Mon, Apr 14, 2014 at 3:14 AM, Ye Xianjin advance...@gmail.com 
(mailto:advance...@gmail.com)(mailto:
   advance...@gmail.com (mailto:advance...@gmail.com)) wrote:

 Hi, everyone:
 I am new to Spark development. I download spark's latest code from
 


   
   github.
 After running sbt/sbt assembly,
 I began running sbt/sbt test in the spark source code dir. But it
 

   
   failed
 running the repl module test.
 
 Here are some output details.
 
 command:
 sbt/sbt test-only org.apache.spark.repl.*
 output:
 
 [info] Loading project definition from
 /Volumes/MacintoshHD/github/spark/project/project
 [info] Loading project definition from
 /Volumes/MacintoshHD/github/spark/project
 [info] Set current project to root (in build
 file:/Volumes/MacintoshHD/github/spark/)
 [info] Passed: Total 0, Failed 0, Errors 0, Passed 0
 [info] No tests to run for graphx/test:testOnly
 [info] Passed: Total 0, Failed 0, Errors 0, Passed 0
 [info] No tests to run for bagel/test:testOnly
 [info] Passed: Total 0, Failed 0, Errors 0, Passed 0
 [info] No tests to run for streaming/test:testOnly
 [info] Passed: Total 0, Failed 0, Errors 0, Passed 0
 [info] No tests to run for mllib/test:testOnly
 [info] Passed: Total 0, Failed 0, Errors 0, Passed 0
 [info] No tests to run for catalyst/test:testOnly
 [info] Passed: Total 0, Failed 0, Errors 0, Passed 0
 [info] No tests to run