Re: Loading Files from HDFS Incurs Network Communication

2015-10-26 Thread Sean Owen
-dev +user
How are you measuring network traffic?
It's not in general true that there will be zero network traffic, since not
all executors are local to all data. That can be the situation in many
cases but not always.

On Mon, Oct 26, 2015 at 8:57 AM, Jinfeng Li  wrote:

> Hi, I find that loading files from HDFS can incur huge amount of network
> traffic. Input size is 90G and network traffic is about 80G. By my
> understanding, local files should be read and thus no network communication
> is needed.
>
> I use Spark 1.5.1, and the following is my code:
>
> val textRDD = sc.textFile("hdfs://master:9000/inputDir")
> textRDD.count
>
> Jeffrey
>


Re: [VOTE] Release Apache Spark 1.5.2 (RC1)

2015-10-26 Thread Patrick Wendell
I verified that the issue with build binaries being present in the source
release is fixed. Haven't done enough vetting for a full vote, but did
verify that.

On Sun, Oct 25, 2015 at 12:07 AM, Reynold Xin  wrote:

> Please vote on releasing the following candidate as Apache Spark
> version 1.5.2. The vote is open until Wed Oct 28, 2015 at 08:00 UTC and
> passes if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 1.5.2
> [ ] -1 Do not release this package because ...
>
>
> The release fixes 51 known issues in Spark 1.5.1, listed here:
> http://s.apache.org/spark-1.5.2
>
> The tag to be voted on is v1.5.2-rc1:
> https://github.com/apache/spark/releases/tag/v1.5.2-rc1
>
> The release files, including signatures, digests, etc. can be found at:
> *http://people.apache.org/~pwendell/spark-releases/spark-1.5.2-rc1-bin/
> *
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> - as version 1.5.2-rc1:
> https://repository.apache.org/content/repositories/orgapachespark-1151
> - as version 1.5.2:
> https://repository.apache.org/content/repositories/orgapachespark-1150
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-v1.5.2-rc1-docs/
>
>
> ===
> How can I help test this release?
> ===
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> 
> What justifies a -1 vote for this release?
> 
> -1 vote should occur for regressions from Spark 1.5.1. Bugs already
> present in 1.5.1 will not block this release.
>
> ===
> What should happen to JIRA tickets still targeting 1.5.2?
> ===
> Please target 1.5.3 or 1.6.0.
>
>
>


Loading Files from HDFS Incurs Network Communication

2015-10-26 Thread Jinfeng Li
Hi, I find that loading files from HDFS can incur huge amount of network
traffic. Input size is 90G and network traffic is about 80G. By my
understanding, local files should be read and thus no network communication
is needed.

I use Spark 1.5.1, and the following is my code:

val textRDD = sc.textFile("hdfs://master:9000/inputDir")
textRDD.count

Jeffrey


RE: spark-sql / apache-drill / jboss-tiied

2015-10-26 Thread prajod.vettiyattil
Hi,

Though not the comparison you wanted, I have implemented a SparkSQL vs Hive 
performance comparison with one master and two worker instances. Data was 
stored in HDFS. SparkSQL showed promise. I used Spark version 1.4 and Hadoop 
version 2.6.

https://hivevssparksql.wordpress.com/

The table data size used for the performance comparison ranged 100,000 to 100 
million rows. The master and slaves ran on EC2 m3.xlarge(4core/15GB RAM).

In the graph you can observe the consistent response behavior of SparkSQL.

Regards,
Prajod

From: Pranay Tonpay [mailto:pton...@gmail.com]
Sent: 25 October 2015 22:35
To: dev@spark.apache.org
Subject: spark-sql / apache-drill / jboss-tiied

Hi,
In terms of federated query, has anyone done any evaluation between spark-sql 
and drill and jboss-tiied.
I have a very urgent requirement for creating a virtualized layer (sitting atop 
several databases) and am evaluating these 3 as an option.. Any help would be 
appreciated.
I know Spark-SQL has the benefit that i can invoke MLLib algorithms on the data 
fetched, but apart from that, any other considerations ?
Drill does not seem to have support for many data sources..

Any inputs ?
thx
pranay
The information contained in this electronic message and any attachments to 
this message are intended for the exclusive use of the addressee(s) and may 
contain proprietary, confidential or privileged information. If you are not the 
intended recipient, you should not disseminate, distribute or copy this e-mail. 
Please notify the sender immediately and destroy all copies of this message and 
any attachments. WARNING: Computer viruses can be transmitted via email. The 
recipient should check this email and any attachments for the presence of 
viruses. The company accepts no liability for any damage caused by any virus 
transmitted by this email. www.wipro.com


Re: Spark Implementation of XGBoost

2015-10-26 Thread DB Tsai
Also, does it support categorical feature?

Sincerely,

DB Tsai
--
Web: https://www.dbtsai.com
PGP Key ID: 0xAF08DF8D


On Mon, Oct 26, 2015 at 4:06 PM, DB Tsai  wrote:
> Interesting. For feature sub-sampling, is it per-node or per-tree? Do
> you think you can implement generic GBM and have it merged as part of
> Spark codebase?
>
> Sincerely,
>
> DB Tsai
> --
> Web: https://www.dbtsai.com
> PGP Key ID: 0xAF08DF8D
>
>
> On Mon, Oct 26, 2015 at 11:42 AM, Meihua Wu
>  wrote:
>> Hi Spark User/Dev,
>>
>> Inspired by the success of XGBoost, I have created a Spark package for
>> gradient boosting tree with 2nd order approximation of arbitrary
>> user-defined loss functions.
>>
>> https://github.com/rotationsymmetry/SparkXGBoost
>>
>> Currently linear (normal) regression, binary classification, Poisson
>> regression are supported. You can extend with other loss function as
>> well.
>>
>> L1, L2, bagging, feature sub-sampling are also employed to avoid overfitting.
>>
>> Thank you for testing. I am looking forward to your comments and
>> suggestions. Bugs or improvements can be reported through GitHub.
>>
>> Many thanks!
>>
>> Meihua
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Spark Implementation of XGBoost

2015-10-26 Thread DB Tsai
Interesting. For feature sub-sampling, is it per-node or per-tree? Do
you think you can implement generic GBM and have it merged as part of
Spark codebase?

Sincerely,

DB Tsai
--
Web: https://www.dbtsai.com
PGP Key ID: 0xAF08DF8D


On Mon, Oct 26, 2015 at 11:42 AM, Meihua Wu
 wrote:
> Hi Spark User/Dev,
>
> Inspired by the success of XGBoost, I have created a Spark package for
> gradient boosting tree with 2nd order approximation of arbitrary
> user-defined loss functions.
>
> https://github.com/rotationsymmetry/SparkXGBoost
>
> Currently linear (normal) regression, binary classification, Poisson
> regression are supported. You can extend with other loss function as
> well.
>
> L1, L2, bagging, feature sub-sampling are also employed to avoid overfitting.
>
> Thank you for testing. I am looking forward to your comments and
> suggestions. Bugs or improvements can be reported through GitHub.
>
> Many thanks!
>
> Meihua
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Spark Implementation of XGBoost

2015-10-26 Thread YiZhi Liu
There's an xgboost exploration jira SPARK-8547. Can it be a good start?

2015-10-27 7:07 GMT+08:00 DB Tsai :
> Also, does it support categorical feature?
>
> Sincerely,
>
> DB Tsai
> --
> Web: https://www.dbtsai.com
> PGP Key ID: 0xAF08DF8D
>
>
> On Mon, Oct 26, 2015 at 4:06 PM, DB Tsai  wrote:
>> Interesting. For feature sub-sampling, is it per-node or per-tree? Do
>> you think you can implement generic GBM and have it merged as part of
>> Spark codebase?
>>
>> Sincerely,
>>
>> DB Tsai
>> --
>> Web: https://www.dbtsai.com
>> PGP Key ID: 0xAF08DF8D
>>
>>
>> On Mon, Oct 26, 2015 at 11:42 AM, Meihua Wu
>>  wrote:
>>> Hi Spark User/Dev,
>>>
>>> Inspired by the success of XGBoost, I have created a Spark package for
>>> gradient boosting tree with 2nd order approximation of arbitrary
>>> user-defined loss functions.
>>>
>>> https://github.com/rotationsymmetry/SparkXGBoost
>>>
>>> Currently linear (normal) regression, binary classification, Poisson
>>> regression are supported. You can extend with other loss function as
>>> well.
>>>
>>> L1, L2, bagging, feature sub-sampling are also employed to avoid 
>>> overfitting.
>>>
>>> Thank you for testing. I am looking forward to your comments and
>>> suggestions. Bugs or improvements can be reported through GitHub.
>>>
>>> Many thanks!
>>>
>>> Meihua
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>



-- 
Yizhi Liu
Senior Software Engineer / Data Mining
www.mvad.com, Shanghai, China

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.5.2 (RC1)

2015-10-26 Thread Krishna Sankar
Guys,
   The sc.version returns 1.5.1 in python and scala. Is anyone getting the
same results ? Probably I am doing something wrong.
Cheers


On Sun, Oct 25, 2015 at 12:07 AM, Reynold Xin  wrote:

> Please vote on releasing the following candidate as Apache Spark
> version 1.5.2. The vote is open until Wed Oct 28, 2015 at 08:00 UTC and
> passes if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 1.5.2
> [ ] -1 Do not release this package because ...
>
>
> The release fixes 51 known issues in Spark 1.5.1, listed here:
> http://s.apache.org/spark-1.5.2
>
> The tag to be voted on is v1.5.2-rc1:
> https://github.com/apache/spark/releases/tag/v1.5.2-rc1
>
> The release files, including signatures, digests, etc. can be found at:
> *http://people.apache.org/~pwendell/spark-releases/spark-1.5.2-rc1-bin/
> *
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> - as version 1.5.2-rc1:
> https://repository.apache.org/content/repositories/orgapachespark-1151
> - as version 1.5.2:
> https://repository.apache.org/content/repositories/orgapachespark-1150
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-v1.5.2-rc1-docs/
>
>
> ===
> How can I help test this release?
> ===
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> 
> What justifies a -1 vote for this release?
> 
> -1 vote should occur for regressions from Spark 1.5.1. Bugs already
> present in 1.5.1 will not block this release.
>
> ===
> What should happen to JIRA tickets still targeting 1.5.2?
> ===
> Please target 1.5.3 or 1.6.0.
>
>
>


Re: Spark Implementation of XGBoost

2015-10-26 Thread Meihua Wu
Hi YiZhi,

Thank you for mentioning the jira. I will add a note to the jira.

Meihua

On Mon, Oct 26, 2015 at 6:16 PM, YiZhi Liu  wrote:
> There's an xgboost exploration jira SPARK-8547. Can it be a good start?
>
> 2015-10-27 7:07 GMT+08:00 DB Tsai :
>> Also, does it support categorical feature?
>>
>> Sincerely,
>>
>> DB Tsai
>> --
>> Web: https://www.dbtsai.com
>> PGP Key ID: 0xAF08DF8D
>>
>>
>> On Mon, Oct 26, 2015 at 4:06 PM, DB Tsai  wrote:
>>> Interesting. For feature sub-sampling, is it per-node or per-tree? Do
>>> you think you can implement generic GBM and have it merged as part of
>>> Spark codebase?
>>>
>>> Sincerely,
>>>
>>> DB Tsai
>>> --
>>> Web: https://www.dbtsai.com
>>> PGP Key ID: 0xAF08DF8D
>>>
>>>
>>> On Mon, Oct 26, 2015 at 11:42 AM, Meihua Wu
>>>  wrote:
 Hi Spark User/Dev,

 Inspired by the success of XGBoost, I have created a Spark package for
 gradient boosting tree with 2nd order approximation of arbitrary
 user-defined loss functions.

 https://github.com/rotationsymmetry/SparkXGBoost

 Currently linear (normal) regression, binary classification, Poisson
 regression are supported. You can extend with other loss function as
 well.

 L1, L2, bagging, feature sub-sampling are also employed to avoid 
 overfitting.

 Thank you for testing. I am looking forward to your comments and
 suggestions. Bugs or improvements can be reported through GitHub.

 Many thanks!

 Meihua

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>
>
>
> --
> Yizhi Liu
> Senior Software Engineer / Data Mining
> www.mvad.com, Shanghai, China

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Re: repartitionAndSortWithinPartitions task shuffle phase is very slow

2015-10-26 Thread 周千昊
I have replace default java serialization with Kyro.
It indeed reduce the shuffle size and the performance has been improved,
however the shuffle speed remains unchanged.
I am quite newbie to Spark, does anyone have idea about towards which
direction I should go to find the root cause?

周千昊 于2015年10月23日周五 下午5:50写道:

> We have not tried that yet, however both implementations on MR and spark
> are tested on the same amount of partition and same cluster
>
> 250635...@qq.com <250635...@qq.com>于2015年10月23日周五 下午5:21写道:
>
>> Hi,
>>
>> Not an expert on this kind of implementation. But referring to the
>> performance result,
>>
>> if the mapside partitions fittable according to the different datasets?
>> Have you tried to
>>
>> increase the count of partitions?
>>
>>
>>
>>
>>
>> 250635...@qq.com
>>
>> From: Li Yang
>> Date: 2015-10-23 16:17
>> To: dev
>> CC: Reynold Xin; dev@spark.apache.org
>> Subject: Re: repartitionAndSortWithinPartitions task shuffle phase is
>> very slow
>> Any advise on how to tune the repartitionAndSortWithinPartitions stage?
>> Any particular metrics or parameter to look into? Basically Spark and MR
>> shuffles the same amount of data, cause we kinda copied MR implementation
>> into Spark.
>>
>> Let us know if more info is needed.
>>
>> On Fri, Oct 23, 2015 at 10:24 AM, 周千昊  wrote:
>>
>> > +kylin dev list
>> >
>> > 周千昊 于2015年10月23日周五 上午10:20写道:
>> >
>> > > Hi, Reynold
>> > >   Using glom() is because it is easy to adapt to calculation logic
>> > > already implemented in MR. And o be clear, we are still in POC.
>> > >   Since the results shows there is almost no difference between
>> this
>> > > glom stage and the MR mapper, using glom here might not be the issue.
>> > >   I was trying to monitor the network traffic when repartition
>> > > happens, and it showed that the traffic peek is about 200 - 300MB/s
>> while
>> > > it stayed at speed of about 3-4MB/s for a long time. Have you guys got
>> > any
>> > > idea about it?
>> > >
>> > > Reynold Xin 于2015年10月23日周五 上午2:43写道:
>> > >
>> > >> Why do you do a glom? It seems unnecessarily expensive to materialize
>> > >> each partition in memory.
>> > >>
>> > >>
>> > >> On Thu, Oct 22, 2015 at 2:02 AM, 周千昊  wrote:
>> > >>
>> > >>> Hi, spark community
>> > >>>   I have an application which I try to migrate from MR to Spark.
>> > >>>   It will do some calculations from Hive and output to hfile
>> which
>> > >>> will be bulk load to HBase Table, details as follow:
>> > >>>
>> > >>>  Rdd input = getSourceInputFromHive()
>> > >>>  Rdd> mapSideResult =
>> > >>> input.glom().mapPartitions(/*some calculation, equivalent to MR
>> mapper
>> > >>> */)
>> > >>>  // PS: the result in each partition has already been sorted
>> > >>> according to the lexicographical order during the calculation
>> > >>>  mapSideResult.repartitionAndSortWithPartitions(/*partition with
>> > >>> byte[][] which is HTable split key, equivalent to MR shuffle
>> > */).map(/*transform
>> > >>> Tuple2 to Tuple2> > KeyValue>/*equivalent
>> > >>> to MR reducer without output*/).saveAsNewAPIHadoopFile(/*write to
>> > >>> hfile*/)
>> > >>>
>> > >>>   This all works fine on a small dataset, and spark outruns MR
>> by
>> > >>> about 10%. However when I apply it on a dataset of 150 million
>> > records, MR
>> > >>> is about 100% faster than spark.(*MR 25min spark 50min*)
>> > >>>After exploring into the application UI, it shows that in the
>> > >>> repartitionAndSortWithinPartitions stage is very slow, and in the
>> > shuffle
>> > >>> phase a 6GB size shuffle cost about 18min which is quite
>> unreasonable
>> > >>>*Can anyone help with this issue and give me some advice on
>> > >>> this? **It’s not iterative processing, however I believe Spark
>> could be
>> > >>> the same fast at minimal.*
>> > >>>
>> > >>>   Here are the cluster info:
>> > >>>   vm: 8 nodes * (128G mem + 64 core)
>> > >>>   hadoop cluster: hdp 2.2.6
>> > >>>   spark running mode: yarn-client
>> > >>>   spark version: 1.5.1
>> > >>>
>> > >>>
>> > >>
>> >
>>
> --
Best Regard
ZhouQianhao


Re: Spark Implementation of XGBoost

2015-10-26 Thread Meihua Wu
Hi DB Tsai,

Thank you very much for your interest and comment.

1) feature sub-sample is per-node, like random forest.

2) The current code heavily exploits the tree structure to speed up
the learning (such as processing multiple learning node in one pass of
the training data). So a generic GBM is likely to be a different
codebase. Do you have any nice reference of efficient GBM? I am more
than happy to look into that.

3) The algorithm accept training data as a DataFrame with the
featureCol indexed by VectorIndexer. You can specify which variable is
categorical in the VectorIndexer. Please note that currently all
categorical variables are treated as ordered. If you want some
categorical variables as unordered, you can pass the data through
OneHotEncoder before the VectorIndexer. I do have a plan to handle
unordered categorical variable using the approach in RF in Spark ML
(Please see roadmap in the README.md)

Thanks,

Meihua



On Mon, Oct 26, 2015 at 4:06 PM, DB Tsai  wrote:
> Interesting. For feature sub-sampling, is it per-node or per-tree? Do
> you think you can implement generic GBM and have it merged as part of
> Spark codebase?
>
> Sincerely,
>
> DB Tsai
> --
> Web: https://www.dbtsai.com
> PGP Key ID: 0xAF08DF8D
>
>
> On Mon, Oct 26, 2015 at 11:42 AM, Meihua Wu
>  wrote:
>> Hi Spark User/Dev,
>>
>> Inspired by the success of XGBoost, I have created a Spark package for
>> gradient boosting tree with 2nd order approximation of arbitrary
>> user-defined loss functions.
>>
>> https://github.com/rotationsymmetry/SparkXGBoost
>>
>> Currently linear (normal) regression, binary classification, Poisson
>> regression are supported. You can extend with other loss function as
>> well.
>>
>> L1, L2, bagging, feature sub-sampling are also employed to avoid overfitting.
>>
>> Thank you for testing. I am looking forward to your comments and
>> suggestions. Bugs or improvements can be reported through GitHub.
>>
>> Many thanks!
>>
>> Meihua
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Spark Implementation of XGBoost

2015-10-26 Thread Meihua Wu
Hi Spark User/Dev,

Inspired by the success of XGBoost, I have created a Spark package for
gradient boosting tree with 2nd order approximation of arbitrary
user-defined loss functions.

https://github.com/rotationsymmetry/SparkXGBoost

Currently linear (normal) regression, binary classification, Poisson
regression are supported. You can extend with other loss function as
well.

L1, L2, bagging, feature sub-sampling are also employed to avoid overfitting.

Thank you for testing. I am looking forward to your comments and
suggestions. Bugs or improvements can be reported through GitHub.

Many thanks!

Meihua

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org