Re: Spark 2.0 Dataset Documentation

2016-06-17 Thread Reynold Xin
Please go for it!

On Friday, June 17, 2016, Pedro Rodriguez  wrote:

> I would be open to working on Dataset documentation if no one else isn't
> already working on it. Thoughts?
>
> On Fri, Jun 17, 2016 at 11:44 PM, Cheng Lian  > wrote:
>
>> As mentioned in the PR description, this is just an initial PR to bring
>> existing contents up to date, so that people can add more contents
>> incrementally.
>>
>> We should definitely cover more about Dataset.
>>
>>
>> Cheng
>>
>> On 6/17/16 10:28 PM, Pedro Rodriguez wrote:
>>
>> The updates look great!
>>
>> Looks like many places are updated to the new APIs, but there still isn't
>> a section for working with Datasets (most of the docs work with
>> Dataframes). Are you planning on adding more? I am thinking something that
>> would address common questions like the one I posted on the user email list
>> earlier today.
>>
>> Should I take discussion to your PR?
>>
>> Pedro
>>
>> On Fri, Jun 17, 2016 at 11:12 PM, Cheng Lian > > wrote:
>>
>>> Hey Pedro,
>>>
>>> SQL programming guide is being updated. Here's the PR, but not merged
>>> yet: https://github.com/apache/spark/pull/13592
>>>
>>> Cheng
>>> On 6/17/16 9:13 PM, Pedro Rodriguez wrote:
>>>
>>> Hi All,
>>>
>>> At my workplace we are starting to use Datasets in 1.6.1 and even more
>>> with Spark 2.0 in place of Dataframes. I looked at the 1.6.1 documentation
>>> then the 2.0 documentation and it looks like not much time has been spent
>>> writing a Dataset guide/tutorial.
>>>
>>> Preview Docs:
>>> https://home.apache.org/~pwendell/spark-releases/spark-2.0.0-preview-docs/sql-programming-guide.html#creating-datasets
>>> Spark master docs:
>>> https://github.com/apache/spark/blob/master/docs/sql-programming-guide.md
>>>
>>> I would like to spend the time to contribute an improvement to those
>>> docs with a more in depth examples of creating and using Datasets (eg using
>>> $ to select columns). Is this of value, and if so what should my next step
>>> be to get this going (create JIRA etc)?
>>>
>>> --
>>> Pedro Rodriguez
>>> PhD Student in Distributed Machine Learning | CU Boulder
>>> R&D Data Science Intern at Oracle Data Cloud
>>> UC Berkeley AMPLab Alumni
>>>
>>> 
>>> ski.rodrig...@gmail.com
>>>  |
>>> pedrorodriguez.io | 909-353-4423
>>> Github: github.com/EntilZha | LinkedIn:
>>> 
>>> https://www.linkedin.com/in/pedrorodriguezscience
>>>
>>>
>>>
>>
>>
>> --
>> Pedro Rodriguez
>> PhD Student in Distributed Machine Learning | CU Boulder
>> UC Berkeley AMPLab Alumni
>>
>> ski.rodrig...@gmail.com
>>  |
>> pedrorodriguez.io | 909-353-4423
>> Github: github.com/EntilZha | LinkedIn:
>> https://www.linkedin.com/in/pedrorodriguezscience
>>
>>
>>
>
>
> --
> Pedro Rodriguez
> PhD Student in Distributed Machine Learning | CU Boulder
> UC Berkeley AMPLab Alumni
>
> ski.rodrig...@gmail.com
>  |
> pedrorodriguez.io | 909-353-4423
> Github: github.com/EntilZha | LinkedIn:
> https://www.linkedin.com/in/pedrorodriguezscience
>
>


Re: Spark 2.0 Dataset Documentation

2016-06-17 Thread Pedro Rodriguez
I would be open to working on Dataset documentation if no one else isn't
already working on it. Thoughts?

On Fri, Jun 17, 2016 at 11:44 PM, Cheng Lian  wrote:

> As mentioned in the PR description, this is just an initial PR to bring
> existing contents up to date, so that people can add more contents
> incrementally.
>
> We should definitely cover more about Dataset.
>
>
> Cheng
>
> On 6/17/16 10:28 PM, Pedro Rodriguez wrote:
>
> The updates look great!
>
> Looks like many places are updated to the new APIs, but there still isn't
> a section for working with Datasets (most of the docs work with
> Dataframes). Are you planning on adding more? I am thinking something that
> would address common questions like the one I posted on the user email list
> earlier today.
>
> Should I take discussion to your PR?
>
> Pedro
>
> On Fri, Jun 17, 2016 at 11:12 PM, Cheng Lian 
> wrote:
>
>> Hey Pedro,
>>
>> SQL programming guide is being updated. Here's the PR, but not merged
>> yet: https://github.com/apache/spark/pull/13592
>>
>> Cheng
>> On 6/17/16 9:13 PM, Pedro Rodriguez wrote:
>>
>> Hi All,
>>
>> At my workplace we are starting to use Datasets in 1.6.1 and even more
>> with Spark 2.0 in place of Dataframes. I looked at the 1.6.1 documentation
>> then the 2.0 documentation and it looks like not much time has been spent
>> writing a Dataset guide/tutorial.
>>
>> Preview Docs:
>> https://home.apache.org/~pwendell/spark-releases/spark-2.0.0-preview-docs/sql-programming-guide.html#creating-datasets
>> Spark master docs:
>> https://github.com/apache/spark/blob/master/docs/sql-programming-guide.md
>>
>> I would like to spend the time to contribute an improvement to those docs
>> with a more in depth examples of creating and using Datasets (eg using $ to
>> select columns). Is this of value, and if so what should my next step be to
>> get this going (create JIRA etc)?
>>
>> --
>> Pedro Rodriguez
>> PhD Student in Distributed Machine Learning | CU Boulder
>> R&D Data Science Intern at Oracle Data Cloud
>> UC Berkeley AMPLab Alumni
>>
>> ski.rodrig...@gmail.com | pedrorodriguez.io |
>> 909-353-4423
>> Github: github.com/EntilZha | LinkedIn:
>> 
>> https://www.linkedin.com/in/pedrorodriguezscience
>>
>>
>>
>
>
> --
> Pedro Rodriguez
> PhD Student in Distributed Machine Learning | CU Boulder
> UC Berkeley AMPLab Alumni
>
> ski.rodrig...@gmail.com | pedrorodriguez.io | 909-353-4423
> Github: github.com/EntilZha | LinkedIn:
> https://www.linkedin.com/in/pedrorodriguezscience
>
>
>


-- 
Pedro Rodriguez
PhD Student in Distributed Machine Learning | CU Boulder
UC Berkeley AMPLab Alumni

ski.rodrig...@gmail.com | pedrorodriguez.io | 909-353-4423
Github: github.com/EntilZha | LinkedIn:
https://www.linkedin.com/in/pedrorodriguezscience


Re: Spark 2.0 Dataset Documentation

2016-06-17 Thread Cheng Lian
As mentioned in the PR description, this is just an initial PR to bring 
existing contents up to date, so that people can add more contents 
incrementally.


We should definitely cover more about Dataset.


Cheng


On 6/17/16 10:28 PM, Pedro Rodriguez wrote:

The updates look great!

Looks like many places are updated to the new APIs, but there still 
isn't a section for working with Datasets (most of the docs work with 
Dataframes). Are you planning on adding more? I am thinking something 
that would address common questions like the one I posted on the user 
email list earlier today.


Should I take discussion to your PR?

Pedro

On Fri, Jun 17, 2016 at 11:12 PM, Cheng Lian > wrote:


Hey Pedro,

SQL programming guide is being updated. Here's the PR, but not
merged yet: https://github.com/apache/spark/pull/13592

Cheng

On 6/17/16 9:13 PM, Pedro Rodriguez wrote:

Hi All,

At my workplace we are starting to use Datasets in 1.6.1 and even
more with Spark 2.0 in place of Dataframes. I looked at the 1.6.1
documentation then the 2.0 documentation and it looks like not
much time has been spent writing a Dataset guide/tutorial.

Preview Docs:

https://home.apache.org/~pwendell/spark-releases/spark-2.0.0-preview-docs/sql-programming-guide.html#creating-datasets


Spark master docs:
https://github.com/apache/spark/blob/master/docs/sql-programming-guide.md


I would like to spend the time to contribute an improvement to
those docs with a more in depth examples of creating and using
Datasets (eg using $ to select columns). Is this of value, and if
so what should my next step be to get this going (create JIRA etc)?

-- 
Pedro Rodriguez

PhD Student in Distributed Machine Learning | CU Boulder
R&D Data Science Intern at Oracle Data Cloud
UC Berkeley AMPLab Alumni

ski.rodrig...@gmail.com  |
pedrorodriguez.io  | 909-353-4423

Github: github.com/EntilZha  |
LinkedIn: https://www.linkedin.com/in/pedrorodriguezscience






--
Pedro Rodriguez
PhD Student in Distributed Machine Learning | CU Boulder
UC Berkeley AMPLab Alumni

ski.rodrig...@gmail.com  | 
pedrorodriguez.io  | 909-353-4423
Github: github.com/EntilZha  | LinkedIn: 
https://www.linkedin.com/in/pedrorodriguezscience






Re: Spark 2.0 Dataset Documentation

2016-06-17 Thread Pedro Rodriguez
The updates look great!

Looks like many places are updated to the new APIs, but there still isn't a
section for working with Datasets (most of the docs work with Dataframes).
Are you planning on adding more? I am thinking something that would address
common questions like the one I posted on the user email list earlier today.

Should I take discussion to your PR?

Pedro

On Fri, Jun 17, 2016 at 11:12 PM, Cheng Lian  wrote:

> Hey Pedro,
>
> SQL programming guide is being updated. Here's the PR, but not merged yet:
> https://github.com/apache/spark/pull/13592
>
> Cheng
> On 6/17/16 9:13 PM, Pedro Rodriguez wrote:
>
> Hi All,
>
> At my workplace we are starting to use Datasets in 1.6.1 and even more
> with Spark 2.0 in place of Dataframes. I looked at the 1.6.1 documentation
> then the 2.0 documentation and it looks like not much time has been spent
> writing a Dataset guide/tutorial.
>
> Preview Docs:
> https://home.apache.org/~pwendell/spark-releases/spark-2.0.0-preview-docs/sql-programming-guide.html#creating-datasets
> Spark master docs:
> https://github.com/apache/spark/blob/master/docs/sql-programming-guide.md
>
> I would like to spend the time to contribute an improvement to those docs
> with a more in depth examples of creating and using Datasets (eg using $ to
> select columns). Is this of value, and if so what should my next step be to
> get this going (create JIRA etc)?
>
> --
> Pedro Rodriguez
> PhD Student in Distributed Machine Learning | CU Boulder
> R&D Data Science Intern at Oracle Data Cloud
> UC Berkeley AMPLab Alumni
>
> ski.rodrig...@gmail.com | pedrorodriguez.io | 909-353-4423
> Github: github.com/EntilZha | LinkedIn:
> https://www.linkedin.com/in/pedrorodriguezscience
>
>
>


-- 
Pedro Rodriguez
PhD Student in Distributed Machine Learning | CU Boulder
UC Berkeley AMPLab Alumni

ski.rodrig...@gmail.com | pedrorodriguez.io | 909-353-4423
Github: github.com/EntilZha | LinkedIn:
https://www.linkedin.com/in/pedrorodriguezscience


Re: Spark 2.0 Dataset Documentation

2016-06-17 Thread Cheng Lian

Hey Pedro,

SQL programming guide is being updated. Here's the PR, but not merged 
yet: https://github.com/apache/spark/pull/13592


Cheng

On 6/17/16 9:13 PM, Pedro Rodriguez wrote:

Hi All,

At my workplace we are starting to use Datasets in 1.6.1 and even more 
with Spark 2.0 in place of Dataframes. I looked at the 1.6.1 
documentation then the 2.0 documentation and it looks like not much 
time has been spent writing a Dataset guide/tutorial.


Preview Docs: 
https://home.apache.org/~pwendell/spark-releases/spark-2.0.0-preview-docs/sql-programming-guide.html#creating-datasets 

Spark master docs: 
https://github.com/apache/spark/blob/master/docs/sql-programming-guide.md


I would like to spend the time to contribute an improvement to those 
docs with a more in depth examples of creating and using Datasets (eg 
using $ to select columns). Is this of value, and if so what should my 
next step be to get this going (create JIRA etc)?


--
Pedro Rodriguez
PhD Student in Distributed Machine Learning | CU Boulder
R&D Data Science Intern at Oracle Data Cloud
UC Berkeley AMPLab Alumni

ski.rodrig...@gmail.com  | 
pedrorodriguez.io  | 909-353-4423
Github: github.com/EntilZha  | LinkedIn: 
https://www.linkedin.com/in/pedrorodriguezscience






Question about equality of o.a.s.sql.Row

2016-06-17 Thread Kazuaki Ishizaki
Dear all,

I have three questions about equality of org.apache.spark.sql.Row.

(1) If a Row has a complex type (e.g. Array), is the following behavior 
expected?
If two Rows has the same array instance, Row.equals returns true in the 
second assert. If two Rows has different array instances (a1 and a2) that 
have the same array elements, Row.equals returns false in the third 
assert.

val a1 = Array(3, 4)
val a2 = Array(3, 4)
val r1 = Row(a1)
val r2 = Row(a2)
assert(a1.sameElements(a2)) // SUCCESS
assert(Row(a1).equals(Row(a1)))  // SUCCESS
assert(Row(a1).equals(Row(a2)))  // FAILURE

This is because two objects are compared by "o1 != o2" instead of 
"o1.equals(o2)" at 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/Row.scala#L408

(2) If (1) is expected, where is this behavior is described or defined? I 
cannot find the description in the API document.
https://spark.apache.org/docs/1.6.1/api/java/org/apache/spark/sql/Row.html
https://home.apache.org/~pwendell/spark-releases/spark-2.0.0-preview-docs/api/scala/index.html#org.apache.spark.sql.Row

(3) If (3) is expected, is there any recommendation to write code of 
equality of two Rows that have an Array or complex types (e.g. Map)?

Best Regards,
Kazuaki Ishizaki, @kiszk



Re: Skew data

2016-06-17 Thread Pedro Rodriguez
I am going to take a guess that this means that your partitions within an
RDD are not balanced (one or more partitions are much larger than the
rest). This would mean a single core would need to do much more work than
the rest leading to poor performance. In general, the way to fix this is to
spread data across partitions evenly. In most cases calling repartition is
enough to solve the problem. If you have a special case you might need
create your own custom partitioner.

Pedro

On Thu, Jun 16, 2016 at 6:55 PM, Selvam Raman  wrote:

> Hi,
>
> What is skew data.
>
> I read that if the data was skewed while joining it would take long time
> to finish the job.(99 percent finished in seconds where 1 percent of task
> taking minutes to hour).
>
> How to handle skewed data in spark.
>
> Thanks,
> Selvam R
> +91-97877-87724
>



-- 
Pedro Rodriguez
PhD Student in Distributed Machine Learning | CU Boulder
UC Berkeley AMPLab Alumni

ski.rodrig...@gmail.com | pedrorodriguez.io | 909-353-4423
Github: github.com/EntilZha | LinkedIn:
https://www.linkedin.com/in/pedrorodriguezscience


Spark 2.0 Dataset Documentation

2016-06-17 Thread Pedro Rodriguez
Hi All,

At my workplace we are starting to use Datasets in 1.6.1 and even more with
Spark 2.0 in place of Dataframes. I looked at the 1.6.1 documentation then
the 2.0 documentation and it looks like not much time has been spent
writing a Dataset guide/tutorial.

Preview Docs:
https://home.apache.org/~pwendell/spark-releases/spark-2.0.0-preview-docs/sql-programming-guide.html#creating-datasets
Spark master docs:
https://github.com/apache/spark/blob/master/docs/sql-programming-guide.md

I would like to spend the time to contribute an improvement to those docs
with a more in depth examples of creating and using Datasets (eg using $ to
select columns). Is this of value, and if so what should my next step be to
get this going (create JIRA etc)?

-- 
Pedro Rodriguez
PhD Student in Distributed Machine Learning | CU Boulder
R&D Data Science Intern at Oracle Data Cloud
UC Berkeley AMPLab Alumni

ski.rodrig...@gmail.com | pedrorodriguez.io | 909-353-4423
Github: github.com/EntilZha | LinkedIn:
https://www.linkedin.com/in/pedrorodriguezscience


Spark 2.0 on YARN - Files in config archive not ending up on executor classpath

2016-06-17 Thread Jonathan Kelly
I'm trying to debug a problem in Spark 2.0.0-SNAPSHOT
(commit bdf5fe4143e5a1a393d97d0030e76d35791ee248) where Spark's
log4j.properties is not getting picked up in the executor classpath (and
driver classpath for yarn-cluster mode), so Hadoop's log4j.properties file
is taking precedence in the YARN containers.

Spark's log4j.properties file is correctly being bundled into the
__spark_conf__.zip file and getting added to the DistributedCache, but it
is not in the classpath of the executor, as evidenced by the following
command, which I ran in spark-shell:

scala> sc.parallelize(Seq(1)).map(_ =>
getClass().getResource("/log4j.properties")).first
res3: java.net.URL = file:/etc/hadoop/conf.empty/log4j.properties

I then ran the following in spark-shell to verify the classpath of the
executors:

scala> sc.parallelize(Seq(1)).map(_ =>
System.getProperty("java.class.path")).flatMap(_.split(':')).filter(e =>
!e.endsWith(".jar") && !e.endsWith("*")).collect.foreach(println)
...
/mnt/yarn/usercache/hadoop/appcache/application_1466208403287_0003/container_1466208403287_0003_01_03
/mnt/yarn/usercache/hadoop/appcache/application_1466208403287_0003/container_1466208403287_0003_01_03/__spark_conf__
/etc/hadoop/conf
...

So the JVM has this nonexistent __spark_conf__ directory in the classpath
when it should really be __spark_conf__.zip (which is actually a symlink to
a directory, despite the .zip filename).

% sudo ls -l
/mnt/yarn/usercache/hadoop/appcache/application_1466208403287_0003/container_1466208403287_0003_01_03
total 20
-rw-r--r-- 1 yarn yarn   88 Jun 18 01:26 container_tokens
-rwx-- 1 yarn yarn  594 Jun 18 01:26
default_container_executor_session.sh
-rwx-- 1 yarn yarn  648 Jun 18 01:26 default_container_executor.sh
-rwx-- 1 yarn yarn 4419 Jun 18 01:26 launch_container.sh
lrwxrwxrwx 1 yarn yarn   59 Jun 18 01:26 __spark_conf__.zip ->
/mnt1/yarn/usercache/hadoop/filecache/17/__spark_conf__.zip
lrwxrwxrwx 1 yarn yarn   77 Jun 18 01:26 __spark_libs__ ->
/mnt/yarn/usercache/hadoop/filecache/16/__spark_libs__4490748779530764463.zip
drwx--x--- 2 yarn yarn   46 Jun 18 01:26 tmp

Does anybody know why this is happening? Is this a bug in Spark, or is it
the JVM doing this (possibly because the extension is .zip)?

Thanks,
Jonathan


Re: Hello

2016-06-17 Thread Michael Armbrust
Another good signal is the "target version" (which by convention is only
set by committers).  When I set this for the upcoming version it means I
think its important enough that I will prioritize reviewing a patch for it.

On Fri, Jun 17, 2016 at 3:22 PM, Pedro Rodriguez 
wrote:

> What is the best way to determine what the library maintainers believe is
> important work to be done?
>
> I have looked through the JIRA and its unclear what are priority items one
> could do work on. I am guessing this is in part because things are a little
> hectic with final work for 2.0, but it would be helpful to know what to
> look for or if its better to ask library maintainers directly.
>
> Thanks,
> Pedro Rodriguez
>
> On Fri, Jun 17, 2016 at 10:46 AM, Xinh Huynh  wrote:
>
>> Here are some guidelines about contributing to Spark:
>>
>> https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark
>>
>> There is also a section specific to MLlib:
>>
>>
>> https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark#ContributingtoSpark-MLlib-specificContributionGuidelines
>>
>> -Xinh
>>
>> On Thu, Jun 16, 2016 at 9:30 AM,  wrote:
>>
>>> Dear All,
>>>
>>>
>>> Looking for guidance.
>>>
>>> I am Interested in contributing to the Spark MLlib. Could you please
>>> take a few minutes to guide me as to what you would consider an ideal path
>>> / skill an individual should posses.
>>>
>>> I know R / Python / Java / C and C++
>>>
>>> I have a firm understanding of algorithms and Machine learning. I do
>>> know spark at a "workable knowledge level".
>>>
>>> Where should I start and what should I try to do first  ( spark internal
>>> level ) and then pick up items on JIRA OR new specifications on Spark.
>>>
>>> R has a great set of packages - would it be difficult to migrate them to
>>> Spark R set. I could try it with your support or if it's desired.
>>>
>>>
>>> I wouldn't mind doing testing of some defects etc as an initial learning
>>> curve if that would assist the community.
>>>
>>> Please, guide.
>>>
>>> Regards,
>>> Harmeet
>>>
>>>
>>>
>>> -
>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: dev-h...@spark.apache.org
>>>
>>>
>>
>
>
> --
> Pedro Rodriguez
> PhD Student in Distributed Machine Learning | CU Boulder
> UC Berkeley AMPLab Alumni
>
> ski.rodrig...@gmail.com | pedrorodriguez.io | 909-353-4423
> Github: github.com/EntilZha | LinkedIn:
> https://www.linkedin.com/in/pedrorodriguezscience
>
>


Re: Hello

2016-06-17 Thread Ted Yu
You can use a JIRA filter to find JIRAs of the component(s) you're
interested in.
Then sort by Priority.

Maybe comment on the JIRA if you want to work on it.

On Fri, Jun 17, 2016 at 3:22 PM, Pedro Rodriguez 
wrote:

> What is the best way to determine what the library maintainers believe is
> important work to be done?
>
> I have looked through the JIRA and its unclear what are priority items one
> could do work on. I am guessing this is in part because things are a little
> hectic with final work for 2.0, but it would be helpful to know what to
> look for or if its better to ask library maintainers directly.
>
> Thanks,
> Pedro Rodriguez
>
> On Fri, Jun 17, 2016 at 10:46 AM, Xinh Huynh  wrote:
>
>> Here are some guidelines about contributing to Spark:
>>
>> https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark
>>
>> There is also a section specific to MLlib:
>>
>>
>> https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark#ContributingtoSpark-MLlib-specificContributionGuidelines
>>
>> -Xinh
>>
>> On Thu, Jun 16, 2016 at 9:30 AM,  wrote:
>>
>>> Dear All,
>>>
>>>
>>> Looking for guidance.
>>>
>>> I am Interested in contributing to the Spark MLlib. Could you please
>>> take a few minutes to guide me as to what you would consider an ideal path
>>> / skill an individual should posses.
>>>
>>> I know R / Python / Java / C and C++
>>>
>>> I have a firm understanding of algorithms and Machine learning. I do
>>> know spark at a "workable knowledge level".
>>>
>>> Where should I start and what should I try to do first  ( spark internal
>>> level ) and then pick up items on JIRA OR new specifications on Spark.
>>>
>>> R has a great set of packages - would it be difficult to migrate them to
>>> Spark R set. I could try it with your support or if it's desired.
>>>
>>>
>>> I wouldn't mind doing testing of some defects etc as an initial learning
>>> curve if that would assist the community.
>>>
>>> Please, guide.
>>>
>>> Regards,
>>> Harmeet
>>>
>>>
>>>
>>> -
>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: dev-h...@spark.apache.org
>>>
>>>
>>
>
>
> --
> Pedro Rodriguez
> PhD Student in Distributed Machine Learning | CU Boulder
> UC Berkeley AMPLab Alumni
>
> ski.rodrig...@gmail.com | pedrorodriguez.io | 909-353-4423
> Github: github.com/EntilZha | LinkedIn:
> https://www.linkedin.com/in/pedrorodriguezscience
>
>


Re: Hello

2016-06-17 Thread Pedro Rodriguez
What is the best way to determine what the library maintainers believe is
important work to be done?

I have looked through the JIRA and its unclear what are priority items one
could do work on. I am guessing this is in part because things are a little
hectic with final work for 2.0, but it would be helpful to know what to
look for or if its better to ask library maintainers directly.

Thanks,
Pedro Rodriguez

On Fri, Jun 17, 2016 at 10:46 AM, Xinh Huynh  wrote:

> Here are some guidelines about contributing to Spark:
>
> https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark
>
> There is also a section specific to MLlib:
>
>
> https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark#ContributingtoSpark-MLlib-specificContributionGuidelines
>
> -Xinh
>
> On Thu, Jun 16, 2016 at 9:30 AM,  wrote:
>
>> Dear All,
>>
>>
>> Looking for guidance.
>>
>> I am Interested in contributing to the Spark MLlib. Could you please take
>> a few minutes to guide me as to what you would consider an ideal path /
>> skill an individual should posses.
>>
>> I know R / Python / Java / C and C++
>>
>> I have a firm understanding of algorithms and Machine learning. I do know
>> spark at a "workable knowledge level".
>>
>> Where should I start and what should I try to do first  ( spark internal
>> level ) and then pick up items on JIRA OR new specifications on Spark.
>>
>> R has a great set of packages - would it be difficult to migrate them to
>> Spark R set. I could try it with your support or if it's desired.
>>
>>
>> I wouldn't mind doing testing of some defects etc as an initial learning
>> curve if that would assist the community.
>>
>> Please, guide.
>>
>> Regards,
>> Harmeet
>>
>>
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>>
>


-- 
Pedro Rodriguez
PhD Student in Distributed Machine Learning | CU Boulder
UC Berkeley AMPLab Alumni

ski.rodrig...@gmail.com | pedrorodriguez.io | 909-353-4423
Github: github.com/EntilZha | LinkedIn:
https://www.linkedin.com/in/pedrorodriguezscience


Re: [VOTE] Release Apache Spark 1.6.2 (RC1)

2016-06-17 Thread Ted Yu
Docker Integration Tests failed on Linux:

http://pastebin.com/Ut51aRV3

Here was the command I used:

mvn clean -Phive -Phive-thriftserver -Pyarn -Phadoop-2.6 -Psparkr
-Dhadoop.version=2.7.0 package

Has anyone seen similar error ?

Thanks

On Thu, Jun 16, 2016 at 9:49 PM, Reynold Xin  wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 1.6.2!
>
> The vote is open until Sunday, June 19, 2016 at 22:00 PDT and passes if a
> majority of at least 3+1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 1.6.2
> [ ] -1 Do not release this package because ...
>
>
> The tag to be voted on is v1.6.2-rc1
> (4168d9c94a9564f6b3e62f5d669acde13a7c7cf6)
>
> The release files, including signatures, digests, etc. can be found at:
> https://home.apache.org/~pwendell/spark-releases/spark-1.6.2-rc1-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1184
>
> The documentation corresponding to this release can be found at:
> https://home.apache.org/~pwendell/spark-releases/spark-1.6.2-rc1-docs/
>
>
> ===
> == How can I help test this release? ==
> ===
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions from 1.6.1.
>
> 
> == What justifies a -1 vote for this release? ==
> 
> This is a maintenance release in the 1.6.x series.  Bugs already present
> in 1.6.1, missing features, or bugs related to new features will not
> necessarily block this release.
>
>


Re: [VOTE] Release Apache Spark 1.6.2 (RC1)

2016-06-17 Thread Marcelo Vanzin
-1 (non-binding)

SPARK-16017 shows a severe perf regression in YARN compared to 1.6.1.

On Thu, Jun 16, 2016 at 9:49 PM, Reynold Xin  wrote:
> Please vote on releasing the following candidate as Apache Spark version
> 1.6.2!
>
> The vote is open until Sunday, June 19, 2016 at 22:00 PDT and passes if a
> majority of at least 3+1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 1.6.2
> [ ] -1 Do not release this package because ...
>
>
> The tag to be voted on is v1.6.2-rc1
> (4168d9c94a9564f6b3e62f5d669acde13a7c7cf6)
>
> The release files, including signatures, digests, etc. can be found at:
> https://home.apache.org/~pwendell/spark-releases/spark-1.6.2-rc1-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1184
>
> The documentation corresponding to this release can be found at:
> https://home.apache.org/~pwendell/spark-releases/spark-1.6.2-rc1-docs/
>
>
> ===
> == How can I help test this release? ==
> ===
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions from 1.6.1.
>
> 
> == What justifies a -1 vote for this release? ==
> 
> This is a maintenance release in the 1.6.x series.  Bugs already present in
> 1.6.1, missing features, or bugs related to new features will not
> necessarily block this release.
>



-- 
Marcelo

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.6.2 (RC1)

2016-06-17 Thread Jonathan Kelly
+1 (non-binding)

On Thu, Jun 16, 2016 at 9:49 PM Reynold Xin  wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 1.6.2!
>
> The vote is open until Sunday, June 19, 2016 at 22:00 PDT and passes if a
> majority of at least 3+1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 1.6.2
> [ ] -1 Do not release this package because ...
>
>
> The tag to be voted on is v1.6.2-rc1
> (4168d9c94a9564f6b3e62f5d669acde13a7c7cf6)
>
> The release files, including signatures, digests, etc. can be found at:
> https://home.apache.org/~pwendell/spark-releases/spark-1.6.2-rc1-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1184
>
> The documentation corresponding to this release can be found at:
> https://home.apache.org/~pwendell/spark-releases/spark-1.6.2-rc1-docs/
>
>
> ===
> == How can I help test this release? ==
> ===
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions from 1.6.1.
>
> 
> == What justifies a -1 vote for this release? ==
> 
> This is a maintenance release in the 1.6.x series.  Bugs already present
> in 1.6.1, missing features, or bugs related to new features will not
> necessarily block this release.
>
>


Re: Hello

2016-06-17 Thread Xinh Huynh
Here are some guidelines about contributing to Spark:

https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark

There is also a section specific to MLlib:

https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark#ContributingtoSpark-MLlib-specificContributionGuidelines

-Xinh

On Thu, Jun 16, 2016 at 9:30 AM,  wrote:

> Dear All,
>
>
> Looking for guidance.
>
> I am Interested in contributing to the Spark MLlib. Could you please take
> a few minutes to guide me as to what you would consider an ideal path /
> skill an individual should posses.
>
> I know R / Python / Java / C and C++
>
> I have a firm understanding of algorithms and Machine learning. I do know
> spark at a "workable knowledge level".
>
> Where should I start and what should I try to do first  ( spark internal
> level ) and then pick up items on JIRA OR new specifications on Spark.
>
> R has a great set of packages - would it be difficult to migrate them to
> Spark R set. I could try it with your support or if it's desired.
>
>
> I wouldn't mind doing testing of some defects etc as an initial learning
> curve if that would assist the community.
>
> Please, guide.
>
> Regards,
> Harmeet
>
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: Regarding on the dataframe stat frequent

2016-06-17 Thread Sean Owen
If you have a clean test case demonstrating the desired behavior, and
a change which makes it work that way, yes make a JIRA and PR.

On Fri, Jun 17, 2016 at 1:35 AM, Luyi Wang  wrote:
> Hey there:
>
> The frequent item in dataframe stat package seems not accurate. In the
> documentation,it did mention that it has false positive but still seems
> incorrect.
>
> Wondering if this is all known problem or not?
>
>
> Here is a quick example showing the problem.
>
> val sqlContext = new SQLContext(sc)
> import sqlContext.implicits._
>
> val rows = Seq((0,"a"),(1, "c"),(2, "a"),(3, "a"),(4, "b"),(5,
> "d")).toDF("id", "category")
> val history = rows.toDF("id", "category")
>
> history.stat.freqItems(Array("category"),0.5).show
> history.stat.freqItems(Array("category"),0.3).show
> history.stat.freqItems(Array("category"),0.51).show
> history.stat.freqItems(Array("category")).show
>
>
> Here is the output
>
> +--+
> |category_freqItems|
> +--+
> |[d, a]|
> +--+
>
> +--+
> |category_freqItems|
> +--+
> |   [a]|
> +--+
>
> +--+
> |category_freqItems|
> +--+
> |[]|
> +--+
>
> +--+
> |category_freqItems|
> +--+
> |  [b, d, a, c]|
> +--+
>
>
>
> The problem results from the freqItemCounter class's add function which is
> used in the function singlePassFreqItems aggregation stage.
>
> Regarding on the paper, the return size of the frequent set can't be larger
> than 1/minimum_support,which we indicated as k hereby, so that  in
> singlePassFreqItems the counterMap is created with this size.
>
> The logic of the add function is following:
>
> To add up the counter of a item, when it already exists in the map,  the
> counter is added up.If it doesn't exist and also map size less than k, it
> inserts.  if it doesn't exist and also current size just equal to size k,
> then it will compare the inserted count with the minimum value. if the
> counter of the to be inserted item is larger than or equals to the current
> minimum, item is inserted and all items with counter value larger than
> current minimum would and smaller and equals to will be removed.  If counter
> of the to be inserted item is smaller than the current minimum, item won't
> be inserted and counters of all items in the map will be deduct the inserted
> counter value.
>
> Problem:
>
> Since it would retain the items larger than the current minimum,  if the
> current minimum is just happened to be the count of second most frequent
> item. it would be removed if the to be inserted item has the same count. In
> this case, possibly a smaller one would be inserted in the map afterward and
> returned later.
>
> Given one example here. "a" appears 3 times, "b" and "c" both appears 2
> times, "d" appears only once, total 8 times, For minimum support 0.5, the
> map is initiated with size 2.   The correct answer should return items
> appears more than 4 times, which is empty. However it returns "a" and "d".
> The reason it returned two items is because of map size. The reason "d" is
> returned is because that "b" and "c" appear the same amount and more than
> "d", but they are cleaned when either one of them already inserted and the
> map reach the size limitation. and when "d" is to be inserted, size is
> smaller and it is inserted.
>
>
> val rows = Seq((0,"a"),(1, "b"),(2, "a"),(3, "a"),(4, "b"),(5,
> "c"),(6,"c"),(7,"d")).toDF("id", "category")
> val history = rows.toDF("id", "category")
>
> history.stat.freqItems(Array("category"),0.5).show
> history.stat.freqItems(Array("category"),0.3).show
> history.stat.freqItems(Array("category"),0.51).show
> history.stat.freqItems(Array("category")).show
>
>
> +--+
> |category_freqItems|
> +--+
> |[d, a]|
> +--+
>
> +--+
> |category_freqItems|
> +--+
> | [b, a, c]|
> +--+
>
> +--+
> |category_freqItems|
> +--+
> |[]|
> +--+
>
> +--+
> |category_freqItems|
> +--+
> |  [b, d, a, c]|
> +--+
>
>
> Hope this explains the problem.
>
> Thanks.
>
> -Luyi.
>
>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Spark internal Logging trait potential thread unsafe

2016-06-17 Thread Sean Owen
I think that's OK to change, yes. I don't see why it's necessary to
init log_ the way it is now. initializeLogIfNecessary() has a purpose
though.

On Fri, Jun 17, 2016 at 2:39 AM, Prajwal Tuladhar  wrote:
> Hi,
>
> The way log instance inside Logger trait is current being initialized
> doesn't seem to be thread safe [1]. Current implementation only guarantees
> initializeLogIfNecessary() is initialized in lazy + thread safe way.
>
> Is there a reason why it can't be just: [2]
>
> @transient private lazy val log_ : Logger = {
> initializeLogIfNecessary(false)
> LoggerFactory.getLogger(logName)
>   }
>
>
> And with that initializeLogIfNecessary() can be called without double
> locking.
>
> --
> --
> Cheers,
> Praj
>
> [1]
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/Logging.scala#L44-L50
> [2]
> https://github.com/apache/spark/blob/8ef3399aff04bf8b7ab294c0f55bcf195995842b/core/src/main/scala/org/apache/spark/internal/Logging.scala#L35

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



testing the kafka 0.10 connector

2016-06-17 Thread Reynold Xin
Cody has graciously worked on a new connector for dstream for Kafka 0.10.
Can people that use Kafka test this connector out? The patch is at
https://github.com/apache/spark/pull/11863

Although we have stopped merging new features into branch-2.0, this
connector is very decoupled from rest of Spark and we might be able to put
this into 2.0.1 (or 2.0.0 if everything works out).

Thanks.


Re: ImportError: No module named numpy

2016-06-17 Thread Bhupendra Mishra
Issue has been fixed after lots of R&D around finally found preety simple
things causing this problem

It was related to permission issue on the python libraries. The user I am
logged in was not having enough permission to read/execute the following
python liabraries.

 /usr/lib/python2.7/site-packages/
/usr/lib64/python2.7/

so above path should have read/execute permission to user executing
python/pyspark program.

Thanks everyone for your help with same. Appreciate!
Regards


On Sun, Jun 5, 2016 at 12:04 AM, Daniel Rodriguez  wrote:

> Like people have said you need numpy in all the nodes of the cluster. The
> easiest way in my opinion is to use anaconda:
> https://www.continuum.io/downloads but that can get tricky to manage in
> multiple nodes if you don't have some configuration management skills.
>
> How are you deploying the spark cluster? If you are using cloudera I
> recommend to use the Anaconda Parcel:
> http://blog.cloudera.com/blog/2016/02/making-python-on-apache-hadoop-easier-with-anaconda-and-cdh/
>
> On 4 Jun 2016, at 11:13, Gourav Sengupta 
> wrote:
>
> Hi,
>
> I think that solution is too simple. Just download anaconda (if you pay
> for the licensed version you will eventually feel like being in heaven when
> you move to CI and CD and live in a world where you have a data product
> actually running in real life).
>
> Then start the pyspark program by including the following:
>
> PYSPARK_PYTHON=< installation>>/anaconda2/bin/python2.7 PATH=$PATH:< installation>>/anaconda/bin <>/pyspark
>
> :)
>
> In case you are using it in EMR the solution is a bit tricky. Just let me
> know in case you want any further help.
>
>
> Regards,
> Gourav Sengupta
>
>
>
>
>
> On Thu, Jun 2, 2016 at 7:59 PM, Eike von Seggern <
> eike.segg...@sevenval.com> wrote:
>
>> Hi,
>>
>> are you using Spark on one machine or many?
>>
>> If on many, are you sure numpy is correctly installed on all machines?
>>
>> To check that the environment is set-up correctly, you can try something
>> like
>>
>> import os
>> pythonpaths = sc.range(10).map(lambda i:
>> os.environ.get("PYTHONPATH")).collect()
>> print(pythonpaths)
>>
>> HTH
>>
>> Eike
>>
>> 2016-06-02 15:32 GMT+02:00 Bhupendra Mishra :
>>
>>> did not resolved. :(
>>>
>>> On Thu, Jun 2, 2016 at 3:01 PM, Sergio Fernández 
>>> wrote:
>>>

 On Thu, Jun 2, 2016 at 9:59 AM, Bhupendra Mishra <
 bhupendra.mis...@gmail.com> wrote:
>
> and i have already exported environment variable in spark-env.sh as
> follows.. error still there  error: ImportError: No module named numpy
>
> export PYSPARK_PYTHON=/usr/bin/python
>

 According the documentation at
 http://spark.apache.org/docs/latest/configuration.html#environment-variables
 the PYSPARK_PYTHON environment variable is for poniting to the Python
 interpreter binary.

 If you check the programming guide
 https://spark.apache.org/docs/0.9.0/python-programming-guide.html#installing-and-configuring-pyspark
 it says you need to add your custom path to PYTHONPATH (the script
 automatically adds the bin/pyspark there).

 So typically in Linux you would need to add the following (assuming you
 installed numpy there):

 export PYTHONPATH=$PYTHONPATH:/usr/lib/python2.7/dist-packages

 Hope that helps.




> On Thu, Jun 2, 2016 at 12:04 AM, Julio Antonio Soto de Vicente <
> ju...@esbet.es> wrote:
>
>> Try adding to spark-env.sh (renaming if you still have it with
>> .template at the end):
>>
>> PYSPARK_PYTHON=/path/to/your/bin/python
>>
>> Where your bin/python is your actual Python environment with Numpy
>> installed.
>>
>>
>> El 1 jun 2016, a las 20:16, Bhupendra Mishra <
>> bhupendra.mis...@gmail.com> escribió:
>>
>> I have numpy installed but where I should setup PYTHONPATH?
>>
>>
>> On Wed, Jun 1, 2016 at 11:39 PM, Sergio Fernández 
>> wrote:
>>
>>> sudo pip install numpy
>>>
>>> On Wed, Jun 1, 2016 at 5:56 PM, Bhupendra Mishra <
>>> bhupendra.mis...@gmail.com> wrote:
>>>
 Thanks .
 How can this be resolved?

 On Wed, Jun 1, 2016 at 9:02 PM, Holden Karau 
 wrote:

> Generally this means numpy isn't installed on the system or your
> PYTHONPATH has somehow gotten pointed somewhere odd,
>
> On Wed, Jun 1, 2016 at 8:31 AM, Bhupendra Mishra <
> bhupendra.mis...@gmail.com> wrote:
>
>> If any one please can help me with following error.
>>
>>  File
>> "/opt/mapr/spark/spark-1.6.1/python/lib/pyspark.zip/pyspark/mllib/__init__.py",
>> line 25, in 
>>
>> ImportError: No module named numpy
>>
>>
>> Thanks in advance!
>>
>>
>
>
> --
> Cell : 425-233-8271
> Twitter: https://twitter.com/holdenkarau
>>