I filed SPARK-12233

2015-12-08 Thread Fengdong Yu
Hi,

I filed an issue, please take a look:

https://issues.apache.org/jira/browse/SPARK-12233


It definitely can be reproduced.









Re: Failed to generate predicate Error when using dropna

2015-12-08 Thread Chang Ya-Hsuan
https://issues.apache.org/jira/browse/SPARK-12231

this is my first time to create JIRA ticket.
is this ticket proper?
thanks

On Tue, Dec 8, 2015 at 9:59 PM, Reynold Xin  wrote:

> Can you create a JIRA ticket for this? Thanks.
>
>
> On Tue, Dec 8, 2015 at 5:25 PM, Chang Ya-Hsuan  wrote:
>
>> spark version: spark-1.5.2-bin-hadoop2.6
>> python version: 2.7.9
>> os: ubuntu 14.04
>>
>> code to reproduce error
>>
>> # write.py
>>
>> import pyspark
>> sc = pyspark.SparkContext()
>> sqlc = pyspark.SQLContext(sc)
>> df = sqlc.range(10)
>> df1 = df.withColumn('a', df['id'] * 2)
>> df1.write.partitionBy('id').parquet('./data')
>>
>>
>> # read.py
>>
>> import pyspark
>> sc = pyspark.SparkContext()
>> sqlc = pyspark.SQLContext(sc)
>> df2 = sqlc.read.parquet('./data')
>> df2.dropna().count()
>>
>>
>> $ spark-submit write.py
>> $ spark-submit read.py
>>
>> # error message
>>
>> 15/12/08 17:20:34 ERROR Filter: Failed to generate predicate, fallback to
>> interpreted org.apache.spark.sql.catalyst.errors.package$TreeNodeException:
>> Binding attribute, tree: a#0L
>> ...
>>
>> If write data without partitionBy, the error won't happen
>> any suggestion?
>> Thanks!
>>
>> --
>> -- 張雅軒
>>
>
>


-- 
-- 張雅軒


Re: A proposal for Spark 2.0

2015-12-08 Thread Kostas Sakellis
I'd also like to make it a requirement that Spark 2.0 have a stable
dataframe and dataset API - we should not leave these APIs experimental in
the 2.0 release. We already know of at least one breaking change we need to
make to dataframes, now's the time to make any other changes we need to
stabilize these APIs. Anything we can do to make us feel more comfortable
about the dataset and dataframe APIs before the 2.0 release?

I've also been thinking that in Spark 2.0, we might want to consider strict
classpath isolation for user applications. Hadoop 3 is moving in this
direction. We could, for instance, run all user applications in their own
classloader that only inherits very specific classes from Spark (ie. public
APIs). This will require user apps to explicitly declare their dependencies
as there won't be any accidental class leaking anymore. We do something
like this for *userClasspathFirst option but it is not as strict as what I
described. This is a breaking change but I think it will help with
eliminating weird classpath incompatibility issues between user
applications and Spark system dependencies.

Thoughts?

Kostas


On Fri, Dec 4, 2015 at 3:28 AM, Sean Owen  wrote:

> To be clear-er, I don't think it's clear yet whether a 1.7 release
> should exist or not. I could see both making sense. It's also not
> really necessary to decide now, well before a 1.6 is even out in the
> field. Deleting the version lost information, and I would not have
> done that given my reply. Reynold maybe I can take this up with you
> offline.
>
> On Thu, Dec 3, 2015 at 6:03 PM, Mark Hamstra 
> wrote:
> > Reynold's post fromNov. 25:
> >
> >> I don't think we should drop support for Scala 2.10, or make it harder
> in
> >> terms of operations for people to upgrade.
> >>
> >> If there are further objections, I'm going to bump remove the 1.7
> version
> >> and retarget things to 2.0 on JIRA.
> >
> >
> > On Thu, Dec 3, 2015 at 12:47 AM, Sean Owen  wrote:
> >>
> >> Reynold, did you (or someone else) delete version 1.7.0 in JIRA? I
> >> think that's premature. If there's a 1.7.0 then we've lost info about
> >> what it would contain. It's trivial at any later point to merge the
> >> versions. And, since things change and there's not a pressing need to
> >> decide one way or the other, it seems fine to at least collect this
> >> info like we have things like "1.4.3" that may never be released. I'd
> >> like to add it back?
> >>
> >> On Thu, Nov 26, 2015 at 9:45 AM, Sean Owen  wrote:
> >> > Maintaining both a 1.7 and 2.0 is too much work for the project, which
> >> > is over-stretched now. This means that after 1.6 it's just small
> >> > maintenance releases in 1.x and no substantial features or evolution.
> >> > This means that the "in progress" APIs in 1.x that will stay that way,
> >> > unless one updates to 2.x. It's not unreasonable, but means the update
> >> > to the 2.x line isn't going to be that optional for users.
> >> >
> >> > Scala 2.10 is already EOL right? Supporting it in 2.x means supporting
> >> > it for a couple years, note. 2.10 is still used today, but that's the
> >> > point of the current stable 1.x release in general: if you want to
> >> > stick to current dependencies, stick to the current release. Although
> >> > I think that's the right way to think about support across major
> >> > versions in general, I can see that 2.x is more of a required update
> >> > for those following the project's fixes and releases. Hence may indeed
> >> > be important to just keep supporting 2.10.
> >> >
> >> > I can't see supporting 2.12 at the same time (right?). Is that a
> >> > concern? it will be long since GA by the time 2.x is first released.
> >> >
> >> > There's another fairly coherent worldview where development continues
> >> > in 1.7 and focuses on finishing the loose ends and lots of bug fixing.
> >> > 2.0 is delayed somewhat into next year, and by that time supporting
> >> > 2.11+2.12 and Java 8 looks more feasible and more in tune with
> >> > currently deployed versions.
> >> >
> >> > I can't say I have a strong view but I personally hadn't imagined 2.x
> >> > would start now.
> >> >
> >> >
> >> > On Thu, Nov 26, 2015 at 7:00 AM, Reynold Xin 
> >> > wrote:
> >> >> I don't think we should drop support for Scala 2.10, or make it
> harder
> >> >> in
> >> >> terms of operations for people to upgrade.
> >> >>
> >> >> If there are further objections, I'm going to bump remove the 1.7
> >> >> version
> >> >> and retarget things to 2.0 on JIRA.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


RE: Data and Model Parallelism in MLPC

2015-12-08 Thread Ulanov, Alexander
Hi Disha,

Which use case do you have in mind that would require model parallelism? It 
should have large number of weights, so it could not fit into the memory of a 
single machine. For example, multilayer perceptron topologies, that are used 
for speech recognition, have up to 100M of weights. Present hardware is capable 
of accommodating this in the main memory. That might be a problem for GPUs, but 
this is a different topic.

The straightforward way of model parallelism for fully connected neural 
networks is to distribute horizontal (or vertical) blocks of weight matrices 
across several nodes. That means that the input data has to be reproduced on 
all these nodes. The forward and the backward passes will require re-assembling 
the outputs and the errors on each of the nodes after each layer, because each 
of the node can produce only partial results since it holds a part of weights. 
According to my estimations, this is inefficient due to large intermediate 
traffic between the nodes and should be used only if the model does not fit in 
memory of a single machine. Another way of model parallelism would be to 
represent the network as the graph and use GraphX to write forward and back 
propagation. However, this option does not seem very practical to me.

Best regards, Alexander

From: Disha Shrivastava [mailto:dishu@gmail.com]
Sent: Tuesday, December 08, 2015 11:19 AM
To: Ulanov, Alexander
Cc: dev@spark.apache.org
Subject: Re: Data and Model Parallelism in MLPC

Hi Alexander,
Thanks for your response. Can you suggest ways to incorporate Model Parallelism 
in MPLC? I am trying to do the same in Spark. I got hold of your post 
http://apache-spark-developers-list.1001551.n3.nabble.com/Model-parallelism-with-RDD-td13141.html
 where you have divided the weight matrix into different worker machines. I 
have two basic questions in this regard:
1. How to actually visualize/analyze and control how nodes of the neural 
network/ weights are divided across different workers?
2. Is there any alternate way to achieve model parallelism for MPLC in Spark? I 
believe we need to have some kind of synchronization and control for the 
updation of weights shared across different workers during backpropagation.
Looking forward for your views on this.
Thanks and Regards,
Disha

On Wed, Dec 9, 2015 at 12:36 AM, Ulanov, Alexander 
mailto:alexander.ula...@hpe.com>> wrote:
Hi Disha,

Multilayer perceptron classifier in Spark implements data parallelism.

Best regards, Alexander

From: Disha Shrivastava [mailto:dishu@gmail.com]
Sent: Tuesday, December 08, 2015 12:43 AM
To: dev@spark.apache.org; Ulanov, Alexander
Subject: Data and Model Parallelism in MLPC

Hi,
I would like to know if the implementation of MLPC in the latest released 
version of Spark ( 1.5.2 ) implements model parallelism and data parallelism as 
done in the DistBelief model implemented by Google  
http://static.googleusercontent.com/media/research.google.com/hi//archive/large_deep_networks_nips2012.pdf
Thanks And Regards,
Disha



Re: Data and Model Parallelism in MLPC

2015-12-08 Thread Disha Shrivastava
Hi Alexander,

Thanks for your response. Can you suggest ways to incorporate Model
Parallelism in MPLC? I am trying to do the same in Spark. I got hold of
your post
http://apache-spark-developers-list.1001551.n3.nabble.com/Model-parallelism-with-RDD-td13141.html
where you have divided the weight matrix into different worker machines. I
have two basic questions in this regard:

1. How to actually visualize/analyze and control how nodes of the neural
network/ weights are divided across different workers?

2. Is there any alternate way to achieve model parallelism for MPLC in
Spark? I believe we need to have some kind of synchronization and control
for the updation of weights shared across different workers during
backpropagation.

Looking forward for your views on this.

Thanks and Regards,
Disha

On Wed, Dec 9, 2015 at 12:36 AM, Ulanov, Alexander  wrote:

> Hi Disha,
>
>
>
> Multilayer perceptron classifier in Spark implements data parallelism.
>
>
>
> Best regards, Alexander
>
>
>
> *From:* Disha Shrivastava [mailto:dishu@gmail.com]
> *Sent:* Tuesday, December 08, 2015 12:43 AM
> *To:* dev@spark.apache.org; Ulanov, Alexander
> *Subject:* Data and Model Parallelism in MLPC
>
>
>
> Hi,
>
> I would like to know if the implementation of MLPC in the latest released
> version of Spark ( 1.5.2 ) implements model parallelism and data
> parallelism as done in the DistBelief model implemented by Google
> http://static.googleusercontent.com/media/research.google.com/hi//archive/large_deep_networks_nips2012.pdf
> 
>
>
> Thanks And Regards,
>
> Disha
>


RE: Data and Model Parallelism in MLPC

2015-12-08 Thread Ulanov, Alexander
Hi Disha,

Multilayer perceptron classifier in Spark implements data parallelism.

Best regards, Alexander

From: Disha Shrivastava [mailto:dishu@gmail.com]
Sent: Tuesday, December 08, 2015 12:43 AM
To: dev@spark.apache.org; Ulanov, Alexander
Subject: Data and Model Parallelism in MLPC

Hi,
I would like to know if the implementation of MLPC in the latest released 
version of Spark ( 1.5.2 ) implements model parallelism and data parallelism as 
done in the DistBelief model implemented by Google  
http://static.googleusercontent.com/media/research.google.com/hi//archive/large_deep_networks_nips2012.pdf
Thanks And Regards,
Disha


Re: Fastest way to build Spark from scratch

2015-12-08 Thread Nicholas Chammas
Interesting. As long as Spark's dependencies don't change that often, the
same caches could save "from scratch" build time over many months of Spark
development. Is that right?

On Tue, Dec 8, 2015 at 12:33 PM Josh Rosen  wrote:

> @Nick, on a fresh EC2 instance a significant chunk of the initial build
> time might be due to artifact resolution + downloading. Putting
> pre-populated Ivy and Maven caches onto your EC2 machine could shave a
> decent chunk of time off that first build.
>
> On Tue, Dec 8, 2015 at 9:16 AM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> Thanks for the tips, Jakob and Steve.
>>
>> It looks like my original approach is the best for me since I'm
>> installing Spark on newly launched EC2 instances and can't take advantage
>> of incremental compilation.
>>
>> Nick
>>
>> On Tue, Dec 8, 2015 at 7:01 AM Steve Loughran 
>> wrote:
>>
>>> On 7 Dec 2015, at 19:07, Jakob Odersky  wrote:
>>>
>>> make-distribution and the second code snippet both create a distribution
>>> from a clean state. They therefore require that every source file be
>>> compiled and that takes time (you can maybe tweak some settings or use a
>>> newer compiler to gain some speed).
>>>
>>> I'm inferring from your question that for your use-case deployment speed
>>> is a critical issue, furthermore you'd like to build Spark for lots of
>>> (every?) commit in a systematic way. In that case I would suggest you try
>>> using the second code snippet without the `clean` task and only resort to
>>> it if the build fails.
>>>
>>> On my local machine, an assembly without a clean drops from 6 minutes to
>>> 2.
>>>
>>> regards,
>>> --Jakob
>>>
>>>
>>> 1. you can use zinc -where possible- to speed up scala compilations
>>> 2. you might also consider setting up a local jenkins VM, hooked to
>>> whatever git repo & branch you are working off, and have it do the builds
>>> and tests for you. Not so great for interactive dev,
>>>
>>> finally, on the mac, the "say" command is pretty handy at letting you
>>> know when some work in a terminal is ready, so you can do the
>>> first-thing-in-the morning build-of-the-SNAPSHOTS
>>>
>>> mvn install -DskipTests -Pyarn,hadoop-2.6 -Dhadoop.version=2.7.1; say moo
>>>
>>> After that you can work on the modules you care about (via the -pl)
>>> option). That doesn't work if you are running on an EC2 instance though
>>>
>>>
>>>
>>>
>>> On 23 November 2015 at 20:18, Nicholas Chammas <
>>> nicholas.cham...@gmail.com> wrote:
>>>
 Say I want to build a complete Spark distribution against Hadoop 2.6+
 as fast as possible from scratch.

 This is what I’m doing at the moment:

 ./make-distribution.sh -T 1C -Phadoop-2.6

 -T 1C instructs Maven to spin up 1 thread per available core. This
 takes around 20 minutes on an m3.large instance.

 I see that spark-ec2, on the other hand, builds Spark as follows
 
 when you deploy Spark at a specific git commit:

 sbt/sbt clean assembly
 sbt/sbt publish-local

 This seems slower than using make-distribution.sh, actually.

 Is there a faster way to do this?

 Nick
 ​

>>>
>>>
>>>
>


Re: [VOTE] Release Apache Spark 1.6.0 (RC1)

2015-12-08 Thread Michael Armbrust
An update: the vote fails due to the -1.   I'll post another RC as soon as
we've resolved these issues.  In the mean time I encourage people to
continue testing and post any problems they encounter here.

On Sun, Dec 6, 2015 at 6:24 PM, Yin Huai  wrote:

> -1
>
> Tow blocker bugs have been found after this RC.
> https://issues.apache.org/jira/browse/SPARK-12089 can cause data
> corruption when an external sorter spills data.
> https://issues.apache.org/jira/browse/SPARK-12155 can prevent tasks from
> acquiring memory even when the executor indeed can allocate memory by
> evicting storage memory.
>
> https://issues.apache.org/jira/browse/SPARK-12089 has been fixed. We are
> still working on https://issues.apache.org/jira/browse/SPARK-12155.
>
> On Fri, Dec 4, 2015 at 3:04 PM, Mark Hamstra 
> wrote:
>
>> 0
>>
>> Currently figuring out who is responsible for the regression that I am
>> seeing in some user code ScalaUDFs that make use of Timestamps and where
>> NULL from a CSV file read in via a TestHive#registerTestTable is now
>> producing 1969-12-31 23:59:59.99 instead of null.
>>
>> On Thu, Dec 3, 2015 at 1:57 PM, Sean Owen  wrote:
>>
>>> Licenses and signature are all fine.
>>>
>>> Docker integration tests consistently fail for me with Java 7 / Ubuntu
>>> and "-Pyarn -Phadoop-2.6 -Phive -Phive-thriftserver"
>>>
>>> *** RUN ABORTED ***
>>>   java.lang.NoSuchMethodError:
>>>
>>> org.apache.http.impl.client.HttpClientBuilder.setConnectionManagerShared(Z)Lorg/apache/http/impl/client/HttpClientBuilder;
>>>   at
>>> org.glassfish.jersey.apache.connector.ApacheConnector.(ApacheConnector.java:240)
>>>   at
>>> org.glassfish.jersey.apache.connector.ApacheConnectorProvider.getConnector(ApacheConnectorProvider.java:115)
>>>   at
>>> org.glassfish.jersey.client.ClientConfig$State.initRuntime(ClientConfig.java:418)
>>>   at
>>> org.glassfish.jersey.client.ClientConfig$State.access$000(ClientConfig.java:88)
>>>   at
>>> org.glassfish.jersey.client.ClientConfig$State$3.get(ClientConfig.java:120)
>>>   at
>>> org.glassfish.jersey.client.ClientConfig$State$3.get(ClientConfig.java:117)
>>>   at
>>> org.glassfish.jersey.internal.util.collection.Values$LazyValueImpl.get(Values.java:340)
>>>   at
>>> org.glassfish.jersey.client.ClientConfig.getRuntime(ClientConfig.java:726)
>>>   at
>>> org.glassfish.jersey.client.ClientRequest.getConfiguration(ClientRequest.java:285)
>>>   at
>>> org.glassfish.jersey.client.JerseyInvocation.validateHttpMethodAndEntity(JerseyInvocation.java:126)
>>>
>>> I also get this failure consistently:
>>>
>>> DirectKafkaStreamSuite
>>> - offset recovery *** FAILED ***
>>>   recoveredOffsetRanges.forall(((or: (org.apache.spark.streaming.Time,
>>> Array[org.apache.spark.streaming.kafka.OffsetRange])) =>
>>>
>>> earlierOffsetRangesAsSets.contains(scala.Tuple2.apply[org.apache.spark.streaming.Time,
>>>
>>> scala.collection.immutable.Set[org.apache.spark.streaming.kafka.OffsetRange]](or._1,
>>>
>>> scala.this.Predef.refArrayOps[org.apache.spark.streaming.kafka.OffsetRange](or._2).toSet[org.apache.spark.streaming.kafka.OffsetRange]
>>> was false Recovered ranges are not the same as the ones generated
>>> (DirectKafkaStreamSuite.scala:301)
>>>
>>> On Wed, Dec 2, 2015 at 8:26 PM, Michael Armbrust 
>>> wrote:
>>> > Please vote on releasing the following candidate as Apache Spark
>>> version
>>> > 1.6.0!
>>> >
>>> > The vote is open until Saturday, December 5, 2015 at 21:00 UTC and
>>> passes if
>>> > a majority of at least 3 +1 PMC votes are cast.
>>> >
>>> > [ ] +1 Release this package as Apache Spark 1.6.0
>>> > [ ] -1 Do not release this package because ...
>>> >
>>> > To learn more about Apache Spark, please see http://spark.apache.org/
>>> >
>>> > The tag to be voted on is v1.6.0-rc1
>>> > (bf525845cef159d2d4c9f4d64e158f037179b5c4)
>>> >
>>> > The release files, including signatures, digests, etc. can be found at:
>>> >
>>> http://people.apache.org/~pwendell/spark-releases/spark-v1.6.0-rc1-bin/
>>> >
>>> > Release artifacts are signed with the following key:
>>> > https://people.apache.org/keys/committer/pwendell.asc
>>> >
>>> > The staging repository for this release can be found at:
>>> >
>>> https://repository.apache.org/content/repositories/orgapachespark-1165/
>>> >
>>> > The test repository (versioned as v1.6.0-rc1) for this release can be
>>> found
>>> > at:
>>> >
>>> https://repository.apache.org/content/repositories/orgapachespark-1164/
>>> >
>>> > The documentation corresponding to this release can be found at:
>>> >
>>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc1-docs/
>>> >
>>> >
>>> > ===
>>> > == How can I help test this release? ==
>>> > ===
>>> > If you are a Spark user, you can help us test this release by taking an
>>> > existing Spark workload and running on this release candidate, then
>>> > reporting any regressions.
>>> >
>>> > 
>>> > == 

Re: Fastest way to build Spark from scratch

2015-12-08 Thread Stephen Boesch
I will echo Steve L's comment about having zinc running (with --nailed).
That provides at least a 2X speedup - sometimes without it spark simply
does not build for me.

2015-12-08 9:33 GMT-08:00 Josh Rosen :

> @Nick, on a fresh EC2 instance a significant chunk of the initial build
> time might be due to artifact resolution + downloading. Putting
> pre-populated Ivy and Maven caches onto your EC2 machine could shave a
> decent chunk of time off that first build.
>
> On Tue, Dec 8, 2015 at 9:16 AM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> Thanks for the tips, Jakob and Steve.
>>
>> It looks like my original approach is the best for me since I'm
>> installing Spark on newly launched EC2 instances and can't take advantage
>> of incremental compilation.
>>
>> Nick
>>
>> On Tue, Dec 8, 2015 at 7:01 AM Steve Loughran 
>> wrote:
>>
>>> On 7 Dec 2015, at 19:07, Jakob Odersky  wrote:
>>>
>>> make-distribution and the second code snippet both create a distribution
>>> from a clean state. They therefore require that every source file be
>>> compiled and that takes time (you can maybe tweak some settings or use a
>>> newer compiler to gain some speed).
>>>
>>> I'm inferring from your question that for your use-case deployment speed
>>> is a critical issue, furthermore you'd like to build Spark for lots of
>>> (every?) commit in a systematic way. In that case I would suggest you try
>>> using the second code snippet without the `clean` task and only resort to
>>> it if the build fails.
>>>
>>> On my local machine, an assembly without a clean drops from 6 minutes to
>>> 2.
>>>
>>> regards,
>>> --Jakob
>>>
>>>
>>> 1. you can use zinc -where possible- to speed up scala compilations
>>> 2. you might also consider setting up a local jenkins VM, hooked to
>>> whatever git repo & branch you are working off, and have it do the builds
>>> and tests for you. Not so great for interactive dev,
>>>
>>> finally, on the mac, the "say" command is pretty handy at letting you
>>> know when some work in a terminal is ready, so you can do the
>>> first-thing-in-the morning build-of-the-SNAPSHOTS
>>>
>>> mvn install -DskipTests -Pyarn,hadoop-2.6 -Dhadoop.version=2.7.1; say moo
>>>
>>> After that you can work on the modules you care about (via the -pl)
>>> option). That doesn't work if you are running on an EC2 instance though
>>>
>>>
>>>
>>>
>>> On 23 November 2015 at 20:18, Nicholas Chammas <
>>> nicholas.cham...@gmail.com> wrote:
>>>
 Say I want to build a complete Spark distribution against Hadoop 2.6+
 as fast as possible from scratch.

 This is what I’m doing at the moment:

 ./make-distribution.sh -T 1C -Phadoop-2.6

 -T 1C instructs Maven to spin up 1 thread per available core. This
 takes around 20 minutes on an m3.large instance.

 I see that spark-ec2, on the other hand, builds Spark as follows
 
 when you deploy Spark at a specific git commit:

 sbt/sbt clean assembly
 sbt/sbt publish-local

 This seems slower than using make-distribution.sh, actually.

 Is there a faster way to do this?

 Nick
 ​

>>>
>>>
>>>
>


Re: Fastest way to build Spark from scratch

2015-12-08 Thread Josh Rosen
@Nick, on a fresh EC2 instance a significant chunk of the initial build
time might be due to artifact resolution + downloading. Putting
pre-populated Ivy and Maven caches onto your EC2 machine could shave a
decent chunk of time off that first build.

On Tue, Dec 8, 2015 at 9:16 AM, Nicholas Chammas  wrote:

> Thanks for the tips, Jakob and Steve.
>
> It looks like my original approach is the best for me since I'm installing
> Spark on newly launched EC2 instances and can't take advantage of
> incremental compilation.
>
> Nick
>
> On Tue, Dec 8, 2015 at 7:01 AM Steve Loughran 
> wrote:
>
>> On 7 Dec 2015, at 19:07, Jakob Odersky  wrote:
>>
>> make-distribution and the second code snippet both create a distribution
>> from a clean state. They therefore require that every source file be
>> compiled and that takes time (you can maybe tweak some settings or use a
>> newer compiler to gain some speed).
>>
>> I'm inferring from your question that for your use-case deployment speed
>> is a critical issue, furthermore you'd like to build Spark for lots of
>> (every?) commit in a systematic way. In that case I would suggest you try
>> using the second code snippet without the `clean` task and only resort to
>> it if the build fails.
>>
>> On my local machine, an assembly without a clean drops from 6 minutes to
>> 2.
>>
>> regards,
>> --Jakob
>>
>>
>> 1. you can use zinc -where possible- to speed up scala compilations
>> 2. you might also consider setting up a local jenkins VM, hooked to
>> whatever git repo & branch you are working off, and have it do the builds
>> and tests for you. Not so great for interactive dev,
>>
>> finally, on the mac, the "say" command is pretty handy at letting you
>> know when some work in a terminal is ready, so you can do the
>> first-thing-in-the morning build-of-the-SNAPSHOTS
>>
>> mvn install -DskipTests -Pyarn,hadoop-2.6 -Dhadoop.version=2.7.1; say moo
>>
>> After that you can work on the modules you care about (via the -pl)
>> option). That doesn't work if you are running on an EC2 instance though
>>
>>
>>
>>
>> On 23 November 2015 at 20:18, Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>>> Say I want to build a complete Spark distribution against Hadoop 2.6+ as
>>> fast as possible from scratch.
>>>
>>> This is what I’m doing at the moment:
>>>
>>> ./make-distribution.sh -T 1C -Phadoop-2.6
>>>
>>> -T 1C instructs Maven to spin up 1 thread per available core. This
>>> takes around 20 minutes on an m3.large instance.
>>>
>>> I see that spark-ec2, on the other hand, builds Spark as follows
>>> 
>>> when you deploy Spark at a specific git commit:
>>>
>>> sbt/sbt clean assembly
>>> sbt/sbt publish-local
>>>
>>> This seems slower than using make-distribution.sh, actually.
>>>
>>> Is there a faster way to do this?
>>>
>>> Nick
>>> ​
>>>
>>
>>
>>


Re: Fastest way to build Spark from scratch

2015-12-08 Thread Nicholas Chammas
Thanks for the tips, Jakob and Steve.

It looks like my original approach is the best for me since I'm installing
Spark on newly launched EC2 instances and can't take advantage of
incremental compilation.

Nick

On Tue, Dec 8, 2015 at 7:01 AM Steve Loughran 
wrote:

> On 7 Dec 2015, at 19:07, Jakob Odersky  wrote:
>
> make-distribution and the second code snippet both create a distribution
> from a clean state. They therefore require that every source file be
> compiled and that takes time (you can maybe tweak some settings or use a
> newer compiler to gain some speed).
>
> I'm inferring from your question that for your use-case deployment speed
> is a critical issue, furthermore you'd like to build Spark for lots of
> (every?) commit in a systematic way. In that case I would suggest you try
> using the second code snippet without the `clean` task and only resort to
> it if the build fails.
>
> On my local machine, an assembly without a clean drops from 6 minutes to 2.
>
> regards,
> --Jakob
>
>
> 1. you can use zinc -where possible- to speed up scala compilations
> 2. you might also consider setting up a local jenkins VM, hooked to
> whatever git repo & branch you are working off, and have it do the builds
> and tests for you. Not so great for interactive dev,
>
> finally, on the mac, the "say" command is pretty handy at letting you know
> when some work in a terminal is ready, so you can do the first-thing-in-the
> morning build-of-the-SNAPSHOTS
>
> mvn install -DskipTests -Pyarn,hadoop-2.6 -Dhadoop.version=2.7.1; say moo
>
> After that you can work on the modules you care about (via the -pl)
> option). That doesn't work if you are running on an EC2 instance though
>
>
>
>
> On 23 November 2015 at 20:18, Nicholas Chammas  > wrote:
>
>> Say I want to build a complete Spark distribution against Hadoop 2.6+ as
>> fast as possible from scratch.
>>
>> This is what I’m doing at the moment:
>>
>> ./make-distribution.sh -T 1C -Phadoop-2.6
>>
>> -T 1C instructs Maven to spin up 1 thread per available core. This takes
>> around 20 minutes on an m3.large instance.
>>
>> I see that spark-ec2, on the other hand, builds Spark as follows
>> 
>> when you deploy Spark at a specific git commit:
>>
>> sbt/sbt clean assembly
>> sbt/sbt publish-local
>>
>> This seems slower than using make-distribution.sh, actually.
>>
>> Is there a faster way to do this?
>>
>> Nick
>> ​
>>
>
>
>


Filte the null before InnerJoin to solve the problem of data skew

2015-12-08 Thread vector
when i join two tables, i find a table has the problem of data skew, and the 
skewing value of the field is null. so i want to filte  the null before 
InnerJoin. like that


a.key is skewed and the skewing value is null


Change


"select * from a join b on a.key = b.key"


to


"select * from a join b on a.key = b.key and a.key is not null"


The idea is feasible ?

Re: Failed to generate predicate Error when using dropna

2015-12-08 Thread Reynold Xin
Can you create a JIRA ticket for this? Thanks.


On Tue, Dec 8, 2015 at 5:25 PM, Chang Ya-Hsuan  wrote:

> spark version: spark-1.5.2-bin-hadoop2.6
> python version: 2.7.9
> os: ubuntu 14.04
>
> code to reproduce error
>
> # write.py
>
> import pyspark
> sc = pyspark.SparkContext()
> sqlc = pyspark.SQLContext(sc)
> df = sqlc.range(10)
> df1 = df.withColumn('a', df['id'] * 2)
> df1.write.partitionBy('id').parquet('./data')
>
>
> # read.py
>
> import pyspark
> sc = pyspark.SparkContext()
> sqlc = pyspark.SQLContext(sc)
> df2 = sqlc.read.parquet('./data')
> df2.dropna().count()
>
>
> $ spark-submit write.py
> $ spark-submit read.py
>
> # error message
>
> 15/12/08 17:20:34 ERROR Filter: Failed to generate predicate, fallback to
> interpreted org.apache.spark.sql.catalyst.errors.package$TreeNodeException:
> Binding attribute, tree: a#0L
> ...
>
> If write data without partitionBy, the error won't happen
> any suggestion?
> Thanks!
>
> --
> -- 張雅軒
>


Re: Fastest way to build Spark from scratch

2015-12-08 Thread Steve Loughran

On 7 Dec 2015, at 19:07, Jakob Odersky 
mailto:joder...@gmail.com>> wrote:

make-distribution and the second code snippet both create a distribution from a 
clean state. They therefore require that every source file be compiled and that 
takes time (you can maybe tweak some settings or use a newer compiler to gain 
some speed).

I'm inferring from your question that for your use-case deployment speed is a 
critical issue, furthermore you'd like to build Spark for lots of (every?) 
commit in a systematic way. In that case I would suggest you try using the 
second code snippet without the `clean` task and only resort to it if the build 
fails.

On my local machine, an assembly without a clean drops from 6 minutes to 2.

regards,
--Jakob

1. you can use zinc -where possible- to speed up scala compilations
2. you might also consider setting up a local jenkins VM, hooked to whatever 
git repo & branch you are working off, and have it do the builds and tests for 
you. Not so great for interactive dev,

finally, on the mac, the "say" command is pretty handy at letting you know when 
some work in a terminal is ready, so you can do the first-thing-in-the morning 
build-of-the-SNAPSHOTS

mvn install -DskipTests -Pyarn,hadoop-2.6 -Dhadoop.version=2.7.1; say moo

After that you can work on the modules you care about (via the -pl) option). 
That doesn't work if you are running on an EC2 instance though




On 23 November 2015 at 20:18, Nicholas Chammas 
mailto:nicholas.cham...@gmail.com>> wrote:

Say I want to build a complete Spark distribution against Hadoop 2.6+ as fast 
as possible from scratch.

This is what I’m doing at the moment:

./make-distribution.sh -T 1C -Phadoop-2.6


-T 1C instructs Maven to spin up 1 thread per available core. This takes around 
20 minutes on an m3.large instance.

I see that spark-ec2, on the other hand, builds Spark as 
follows
 when you deploy Spark at a specific git commit:

sbt/sbt clean assembly
sbt/sbt publish-local


This seems slower than using make-distribution.sh, actually.

Is there a faster way to do this?

Nick

​




Failed to generate predicate Error when using dropna

2015-12-08 Thread Chang Ya-Hsuan
spark version: spark-1.5.2-bin-hadoop2.6
python version: 2.7.9
os: ubuntu 14.04

code to reproduce error

# write.py

import pyspark
sc = pyspark.SparkContext()
sqlc = pyspark.SQLContext(sc)
df = sqlc.range(10)
df1 = df.withColumn('a', df['id'] * 2)
df1.write.partitionBy('id').parquet('./data')


# read.py

import pyspark
sc = pyspark.SparkContext()
sqlc = pyspark.SQLContext(sc)
df2 = sqlc.read.parquet('./data')
df2.dropna().count()


$ spark-submit write.py
$ spark-submit read.py

# error message

15/12/08 17:20:34 ERROR Filter: Failed to generate predicate, fallback to
interpreted org.apache.spark.sql.catalyst.errors.package$TreeNodeException:
Binding attribute, tree: a#0L
...

If write data without partitionBy, the error won't happen
any suggestion?
Thanks!

-- 
-- 張雅軒


回复: mlib compilation errors

2015-12-08 Thread wei....@kaiyuandao.com
probably it is because I ran "./dev/change-scala-version.sh 2.11" after 
importing these projects in intellij. I reimported these projects later. it 
works fine.  

closed for this thread. thanks



 
发件人: wei@kaiyuandao.com
发送时间: 2015-12-07 16:43
收件人: dev
主题: mlib compilation errors
hi, when I was compiling the mlib project in Intellij, it has the following 
errors. If I run mvn from command line, it works well. anyone came to the same 
issue? thanks





Data and Model Parallelism in MLPC

2015-12-08 Thread Disha Shrivastava
Hi,

I would like to know if the implementation of MLPC in the latest released
version of Spark ( 1.5.2 ) implements model parallelism and data
parallelism as done in the DistBelief model implemented by Google
http://static.googleusercontent.com/media/research.google.com/hi//archive/large_deep_networks_nips2012.pdf


Thanks And Regards,
Disha