Fwd: multiple count distinct in SQL/DataFrame?

2015-10-07 Thread Reynold Xin
Adding user list too.



-- Forwarded message --
From: Reynold Xin 
Date: Tue, Oct 6, 2015 at 5:54 PM
Subject: Re: multiple count distinct in SQL/DataFrame?
To: "dev@spark.apache.org" 


To provide more context, if we do remove this feature, the following SQL
query would throw an AnalysisException:

select count(distinct colA), count(distinct colB) from foo;

The following should still work:

select count(distinct colA) from foo;

The following should also work:

select count(distinct colA, colB) from foo;


On Tue, Oct 6, 2015 at 5:51 PM, Reynold Xin  wrote:

> The current implementation of multiple count distinct in a single query is
> very inferior in terms of performance and robustness, and it is also hard
> to guarantee correctness of the implementation in some of the refactorings
> for Tungsten. Supporting a better version of it is possible in the future,
> but will take a lot of engineering efforts. Most other Hadoop-based SQL
> systems (e.g. Hive, Impala) don't support this feature.
>
> As a result, we are considering removing support for multiple count
> distinct in a single query in the next Spark release (1.6). If you use this
> feature, please reply to this email. Thanks.
>
> Note that if you don't care about null values, it is relatively easy to
> reconstruct a query using joins to support multiple distincts.
>
>
>


Re: multiple count distinct in SQL/DataFrame?

2015-10-07 Thread Herman van Hövell tot Westerflier
We could also fallback to approximate count distincts when the user
requests multiple count distincts. This is less invasive than throwing an
AnalysisException, but it could violate the principle of least surprise.



Met vriendelijke groet/Kind regards,

Herman van Hövell tot Westerflier

QuestTec B.V.
Torenwacht 98
2353 DC Leiderdorp
hvanhov...@questtec.nl
+599 9 521 4402


2015-10-07 22:43 GMT+02:00 Reynold Xin :

> Adding user list too.
>
>
>
> -- Forwarded message --
> From: Reynold Xin 
> Date: Tue, Oct 6, 2015 at 5:54 PM
> Subject: Re: multiple count distinct in SQL/DataFrame?
> To: "dev@spark.apache.org" 
>
>
> To provide more context, if we do remove this feature, the following SQL
> query would throw an AnalysisException:
>
> select count(distinct colA), count(distinct colB) from foo;
>
> The following should still work:
>
> select count(distinct colA) from foo;
>
> The following should also work:
>
> select count(distinct colA, colB) from foo;
>
>
> On Tue, Oct 6, 2015 at 5:51 PM, Reynold Xin  wrote:
>
>> The current implementation of multiple count distinct in a single query
>> is very inferior in terms of performance and robustness, and it is also
>> hard to guarantee correctness of the implementation in some of the
>> refactorings for Tungsten. Supporting a better version of it is possible in
>> the future, but will take a lot of engineering efforts. Most other
>> Hadoop-based SQL systems (e.g. Hive, Impala) don't support this feature.
>>
>> As a result, we are considering removing support for multiple count
>> distinct in a single query in the next Spark release (1.6). If you use this
>> feature, please reply to this email. Thanks.
>>
>> Note that if you don't care about null values, it is relatively easy to
>> reconstruct a query using joins to support multiple distincts.
>>
>>
>>
>
>


Re: multiple count distinct in SQL/DataFrame?

2015-10-07 Thread Mayank Pradhan
Is this limited only to grand multiple count distincts or does it extends
to all kinds of multiple count distincts? More precisely would the
following multiple count distinct query also be affected?
select a, b, count(distinct x), count(distinct y) from foo group by a,b;

It would be unfortunate to loose that too.

-Mayank

On Wed, Oct 7, 2015 at 1:56 PM, Herman van Hövell tot Westerflier <
hvanhov...@questtec.nl> wrote:

> We could also fallback to approximate count distincts when the user
> requests multiple count distincts. This is less invasive than throwing an
> AnalysisException, but it could violate the principle of least surprise.
>
>
>
> Met vriendelijke groet/Kind regards,
>
> Herman van Hövell tot Westerflier
>
> QuestTec B.V.
> Torenwacht 98
> 2353 DC Leiderdorp
> hvanhov...@questtec.nl
> +599 9 521 4402
>
>
> 2015-10-07 22:43 GMT+02:00 Reynold Xin :
>
>> Adding user list too.
>>
>>
>>
>> -- Forwarded message --
>> From: Reynold Xin 
>> Date: Tue, Oct 6, 2015 at 5:54 PM
>> Subject: Re: multiple count distinct in SQL/DataFrame?
>> To: "dev@spark.apache.org" 
>>
>>
>> To provide more context, if we do remove this feature, the following SQL
>> query would throw an AnalysisException:
>>
>> select count(distinct colA), count(distinct colB) from foo;
>>
>> The following should still work:
>>
>> select count(distinct colA) from foo;
>>
>> The following should also work:
>>
>> select count(distinct colA, colB) from foo;
>>
>>
>> On Tue, Oct 6, 2015 at 5:51 PM, Reynold Xin  wrote:
>>
>>> The current implementation of multiple count distinct in a single query
>>> is very inferior in terms of performance and robustness, and it is also
>>> hard to guarantee correctness of the implementation in some of the
>>> refactorings for Tungsten. Supporting a better version of it is possible in
>>> the future, but will take a lot of engineering efforts. Most other
>>> Hadoop-based SQL systems (e.g. Hive, Impala) don't support this feature.
>>>
>>> As a result, we are considering removing support for multiple count
>>> distinct in a single query in the next Spark release (1.6). If you use this
>>> feature, please reply to this email. Thanks.
>>>
>>> Note that if you don't care about null values, it is relatively easy to
>>> reconstruct a query using joins to support multiple distincts.
>>>
>>>
>>>
>>
>>
>


What is the difference between ml.classification.LogisticRegression and mllib.classification.LogisticRegressionWithLBFGS

2015-10-07 Thread YiZhi Liu
Hi everyone,

I'm curious about the difference between
ml.classification.LogisticRegression and
mllib.classification.LogisticRegressionWithLBFGS. Both of them are
optimized using LBFGS, the only difference I see is LogisticRegression
takes DataFrame while LogisticRegressionWithLBFGS takes RDD.

So I wonder,
1. Why not simply add a DataFrame training interface to
LogisticRegressionWithLBFGS?
2. Whats the difference between ml.classification and
mllib.classification package?
3. Why doesn't ml.classification.LogisticRegression call
mllib.optimization.LBFGS / mllib.optimization.OWLQN directly? Instead,
it uses breeze.optimize.LBFGS and re-implements most of the procedures
in mllib.optimization.{LBFGS,OWLQN}.

Thank you.

Best,

-- 
Yizhi Liu
Senior Software Engineer / Data Mining
www.mvad.com, Shanghai, China

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



RE: GraphX PageRank keeps 3 copies of graph in memory

2015-10-07 Thread Ulanov, Alexander
Hi Ankur,

Could you help with explanation of the problem below?

Best regards, Alexander

From: Ulanov, Alexander
Sent: Friday, October 02, 2015 11:39 AM
To: 'Robin East'
Cc: dev@spark.apache.org
Subject: RE: GraphX PageRank keeps 3 copies of graph in memory

Hi Robin,

Sounds interesting. I am running 1.5.0. Could you copy-paste your Storage tab?

I’ve just double checked on another cluster with 1 master and 5 workers. It 
still has 3 pairs of VertexRDD and EdgeRDD at the end of benchmark’s execution:

RDD Name  Storage Level Cached Partitions Fraction 
CachedSize in Memory Size in ExternalBlockStore  Size 
on Disk
VertexRDD Memory Deserialized 1x Replicated 3  150% 
6.9 MB  0.0 B  0.0 B
EdgeRDD Memory Deserialized 1x Replicated 2  
100% 155.5 MB 0.0 B  0.0 B
EdgeRDD Memory Deserialized 1x Replicated 2  
100% 154.7 MB 0.0 B  0.0 B
VertexRDD, VertexRDD Memory Deserialized 1x Replicated 3  
150% 8.4 MB  0.0 B  0.0 B
EdgeRDD Memory Deserialized 1x Replicated 2  
100% 202.9 MB 0.0 B  0.0 B
VertexRDD Memory Deserialized 1x Replicated 2  100% 
5.6 MB  0.0 B  0.0 B

During the execution I observe that one pair is added and removed from the 
list. This should correspond to the unpersist statements in the code.

Also, according to the code, you one should end up with 1 set of RDDs, because 
of unpersist statements in the end of the loop. Does it make sense to you?

Best regards, Alexander

From: Robin East [mailto:robin.e...@xense.co.uk]
Sent: Friday, October 02, 2015 12:27 AM
To: Ulanov, Alexander
Cc: dev@spark.apache.org
Subject: Re: GraphX PageRank keeps 3 copies of graph in memory

Alexander,

I’ve just run the benchmark and only end up with 2 sets of RDDs in the Storage 
tab. This is on 1.5.0, what version are you using?

Robin
---
Robin East
Spark GraphX in Action Michael Malak and Robin East
Manning Publications Co.
http://www.manning.com/books/spark-graphx-in-action




On 30 Sep 2015, at 23:55, Ulanov, Alexander 
> wrote:

Dear Spark developers,

I would like to understand GraphX caching behavior with regards to PageRank in 
Spark, in particular, the following implementation of PageRank:
https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/lib/PageRank.scala

On each iteration the new graph is created and cached, and the old graph is 
un-cached:
1) Create new graph and cache it:
rankGraph = rankGraph.joinVertices(rankUpdates) {
(id, oldRank, msgSum) => rPrb(src, id) + (1.0 - resetProb) * msgSum
  }.cache()
2) Unpersist the old one:
  prevRankGraph.vertices.unpersist(false)
  prevRankGraph.edges.unpersist(false)

According to the code, at the end of each iteration only one graph should be in 
memory, i.e. one EdgeRDD and one VertexRDD. During the iteration, exactly 
between the mentioned lines of code, there will be two graphs: old and new. It 
is two pairs of Edge and Vertex RDDs. However, when I run the example provided 
in Spark examples folder, I observe the different behavior.

Run the example (I checked that it runs the mentioned code):
$SPARK_HOME/bin/spark-submit --class 
"org.apache.spark.examples.graphx.SynthBenchmark"  --master 
spark://mynode.net:7077 $SPARK_HOME/examples/target/spark-examples.jar

According to “Storage” and RDD DAG in Spark UI, 3 VertexRDDs and 3 EdgeRDDs are 
cached, even when all iterations are finished, given that the mentioned code 
suggests caching at most 2 (and only in particular stage of the iteration):
https://drive.google.com/file/d/0BzYMzvDiCep5WFpnQjFzNy0zYlU/view?usp=sharing
Edges (the green ones are cached):
https://drive.google.com/file/d/0BzYMzvDiCep5S2JtYnhVTlV1Sms/view?usp=sharing
Vertices (the green ones are cached):
https://drive.google.com/file/d/0BzYMzvDiCep5S1k4N2NFb05RZDA/view?usp=sharing

Could you explain, why 3 VertexRDDs and 3 EdgeRDDs are cached?

Is it OK that there is a double caching in code, given that joinVertices 
implicitly caches vertices and then the graph is cached in the PageRank code?

Best regards, Alexander



Re: SparkSQL: First query execution is always slower than subsequent queries

2015-10-07 Thread Michael Armbrust
-dev +user

1). Is that the reason why it's always slow in the first run? Or are there
> any other reasons? Apparently it loads data to memory every time so it
> shouldn't be something to do with disk read should it?
>

You are probably seeing the effect of the JVMs JIT.  The first run is
executing in interpreted mode.  Once the JVM sees its a hot piece of code
it will compile it to native code.  This applies both to Spark / Spark SQL
itself and (as of Spark 1.5) the code that we dynamically generate for
doing expression evaluation.  Multiple runs with the same expressions will
used cached code that might have been JITed.


> 2). Does Spark use the Hadoop's Map Reduce engine under the hood? If so
> can we configure it to use MR2 instead of MR1.
>

No, we do not use the map reduce engine for execution.  You can however
compile Spark to work with either version of hadoop for so you can access
HDFS, etc.


Understanding code/closure shipment to Spark workers‏

2015-10-07 Thread Arijit
 Hi,
 
I want to understand the code flow starting from the Spark jar that I submit 
through spark-submit, how does Spark identify and extract the closures, clean 
and serialize them and ship them to workers to execute as tasks. Can someone 
point me to any documentation or a pointer to the source code path to help me 
understand this.
 
Thanks, Arijit

SparkSQL: First query execution is always slower than subsequent queries

2015-10-07 Thread Lloyd Haris
Hi Spark Devs,

I am doing a performance evaluation of Spark using pyspark. I am using
Spark 1.5 with a Hadoop 2.6 cluster of 4 nodes and ran these tests on local
mode.

After a few dozen test executions, it turned out that the very first
SparkSQL query execution is always slower than the subsequent executions of
the same query.

First run:

15/10/08 11:15:35 INFO ParseDriver: Parsing command: SELECT CATAID, RA, DEC
FROM InputCatA
15/10/08 11:15:36 INFO ParseDriver: Parse Completed
15/10/08 11:15:36 INFO MemoryStore: ensureFreeSpace(484576) called with
curMem=0, maxMem=556038881
15/10/08 11:15:36 INFO MemoryStore: Block broadcast_0 stored as values in
memory (estimated size 473.2 KB, free 529.8 MB)
15/10/08 11:15:37 INFO MemoryStore: ensureFreeSpace(45559) called with
curMem=484576, maxMem=556038881
15/10/08 11:15:37 INFO MemoryStore: Block broadcast_0_piece0 stored as
bytes in memory (estimated size 44.5 KB, free 529.8 MB)
15/10/08 11:15:37 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory
on localhost:59594 (size: 44.5 KB, free: 530.2 MB)
15/10/08 11:15:37 INFO SparkContext: Created broadcast 0 from collect at
:3
15/10/08 11:15:37 INFO FileInputFormat: Total input paths to process : 4
15/10/08 11:15:37 INFO SparkContext: Starting job: collect at :3
15/10/08 11:15:37 INFO DAGScheduler: Got job 0 (collect at :3) with
4 output partitions
15/10/08 11:15:37 INFO DAGScheduler: Final stage: ResultStage 0(collect at
:3)
15/10/08 11:15:37 INFO DAGScheduler: Parents of final stage: List()
15/10/08 11:15:37 INFO DAGScheduler: Missing parents: List()
15/10/08 11:15:37 INFO DAGScheduler: Submitting ResultStage 0
(MapPartitionsRDD[4] at collect at :3), which has no missing parents
15/10/08 11:15:37 INFO MemoryStore: ensureFreeSpace(8896) called with
curMem=530135, maxMem=556038881
15/10/08 11:15:37 INFO MemoryStore: Block broadcast_1 stored as values in
memory (estimated size 8.7 KB, free 529.8 MB)
15/10/08 11:15:37 INFO MemoryStore: ensureFreeSpace(4679) called with
curMem=539031, maxMem=556038881
15/10/08 11:15:37 INFO MemoryStore: Block broadcast_1_piece0 stored as
bytes in memory (estimated size 4.6 KB, free 529.8 MB)
15/10/08 11:15:37 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory
on localhost:59594 (size: 4.6 KB, free: 530.2 MB)
15/10/08 11:15:37 INFO SparkContext: Created broadcast 1 from broadcast at
DAGScheduler.scala:861
15/10/08 11:15:37 INFO DAGScheduler: Submitting 4 missing tasks from
ResultStage 0 (MapPartitionsRDD[4] at collect at :3)
15/10/08 11:15:37 INFO TaskSchedulerImpl: Adding task set 0.0 with 4 tasks
15/10/08 11:15:37 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID
0, localhost, ANY, 2184 bytes)
15/10/08 11:15:37 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID
1, localhost, ANY, 2184 bytes)
15/10/08 11:15:37 INFO TaskSetManager: Starting task 2.0 in stage 0.0 (TID
2, localhost, ANY, 2184 bytes)
15/10/08 11:15:37 INFO TaskSetManager: Starting task 3.0 in stage 0.0 (TID
3, localhost, ANY, 2184 bytes)
15/10/08 11:15:37 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
15/10/08 11:15:37 INFO Executor: Running task 2.0 in stage 0.0 (TID 2)
15/10/08 11:15:37 INFO Executor: Running task 3.0 in stage 0.0 (TID 3)
15/10/08 11:15:37 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
15/10/08 11:15:37 INFO HadoopRDD: Input split:
hdfs://:8020/user/hive/warehouse/inputcata/part-m-3:0+55353265
15/10/08 11:15:37 INFO HadoopRDD: Input split:
hdfs://:8020/user/hive/warehouse/inputcata/part-m-0:0+55149172
15/10/08 11:15:37 INFO HadoopRDD: Input split:
hdfs://:8020/user/hive/warehouse/inputcata/part-m-2:0+55418752
15/10/08 11:15:37 INFO HadoopRDD: Input split:
hdfs://:8020/user/hive/warehouse/inputcata/part-m-1:0+55083905
15/10/08 11:15:37 INFO deprecation: mapred.tip.id is deprecated. Instead,
use mapreduce.task.id
15/10/08 11:15:37 INFO deprecation: mapred.task.id is deprecated. Instead,
use mapreduce.task.attempt.id
15/10/08 11:15:37 INFO deprecation: mapred.task.is.map is deprecated.
Instead, use mapreduce.task.ismap
15/10/08 11:15:37 INFO deprecation: mapred.task.partition is deprecated.
Instead, use mapreduce.task.partition
15/10/08 11:15:37 INFO deprecation: mapred.job.id is deprecated. Instead,
use mapreduce.job.id
15/10/08 11:15:39 INFO Executor: Finished task 1.0 in stage 0.0 (TID 1).
8919569 bytes result sent to driver
15/10/08 11:15:39 INFO Executor: Finished task 3.0 in stage 0.0 (TID 3).
8919548 bytes result sent to driver
15/10/08 11:15:39 INFO Executor: Finished task 2.0 in stage 0.0 (TID 2).
8926124 bytes result sent to driver
15/10/08 11:15:39 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID
1) in 2186 ms on localhost (1/4)
15/10/08 11:15:39 INFO TaskSetManager: Finished task 3.0 in stage 0.0 (TID
3) in 2203 ms on localhost (2/4)
15/10/08 11:15:39 INFO TaskSetManager: Finished task 2.0 in stage 0.0 (TID
2) in 2208 ms on localhost (3/4)
15/10/08 11:15:39 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0).
8794038 bytes result 

Re: Spark 1.5.1 - Scala 2.10 - Hadoop 1 package is missing from S3

2015-10-07 Thread Nicholas Chammas
Sounds good to me.

For my purposes, I'm less concerned about old Spark artifacts and more
concerned about the consistency of the set of artifacts that get generated
with new releases. (e.g. Each new release will always include one artifact
each for Hadoop 1, Hadoop 1 + Scala 2.11, etc...)

It sounds like we can expect that set to stay the same with new releases
for now, but it's not a hard guarantee. I think that's fine for now.

Nick

On Wed, Oct 7, 2015 at 1:57 PM Patrick Wendell  wrote:

> I don't think we have a firm contract around that. So far we've never
> removed old artifacts, but the ASF has asked us at time to decrease the
> size of binaries we post. In the future at some point we may drop older
> ones since we keep adding new ones.
>
> If downstream projects are depending on our artifacts, I'd say just hold
> tight for now until something changes. If it changes, then those projects
> might need to build Spark on their own and host older hadoop versions, etc.
>
> On Wed, Oct 7, 2015 at 9:59 AM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> Thanks guys.
>>
>> Regarding this earlier question:
>>
>> More importantly, is there some rough specification for what packages we
>> should be able to expect in this S3 bucket with every release?
>>
>> Is the implied answer that we should continue to expect the same set of
>> artifacts for every release for the foreseeable future?
>>
>> Nick
>> ​
>>
>> On Tue, Oct 6, 2015 at 1:13 AM Patrick Wendell 
>> wrote:
>>
>>> The missing artifacts are uploaded now. Things should propagate in the
>>> next 24 hours. If there are still issues past then ping this thread. Thanks!
>>>
>>> - Patrick
>>>
>>> On Mon, Oct 5, 2015 at 2:41 PM, Nicholas Chammas <
>>> nicholas.cham...@gmail.com> wrote:
>>>
 Thanks for looking into this Josh.

 On Mon, Oct 5, 2015 at 5:39 PM Josh Rosen 
 wrote:

> I'm working on a fix for this right now. I'm planning to re-run a
> modified copy of the release packaging scripts which will emit only the
> missing artifacts (so we won't upload new artifacts with different SHAs 
> for
> the builds which *did* succeed).
>
> I expect to have this finished in the next day or so; I'm currently
> blocked by some infra downtime but expect that to be resolved soon.
>
> - Josh
>
> On Mon, Oct 5, 2015 at 8:46 AM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> Blaž said:
>>
>> Also missing is
>> http://s3.amazonaws.com/spark-related-packages/spark-1.5.1-bin-hadoop1.tgz
>> which breaks spark-ec2 script.
>>
>> This is the package I am referring to in my original email.
>>
>> Nick said:
>>
>> It appears that almost every version of Spark up to and including
>> 1.5.0 has included a —bin-hadoop1.tgz release (e.g.
>> spark-1.5.0-bin-hadoop1.tgz). However, 1.5.1 has no such package.
>>
>> Nick
>> ​
>>
>> On Mon, Oct 5, 2015 at 3:27 AM Blaž Šnuderl 
>> wrote:
>>
>>> Also missing is
>>> http://s3.amazonaws.com/spark-related-packages/spark-1.5.1-bin-hadoop1.tgz
>>> which breaks spark-ec2 script.
>>>
>>> On Mon, Oct 5, 2015 at 5:20 AM, Ted Yu  wrote:
>>>
 hadoop1 package for Scala 2.10 wasn't in RC1 either:

 http://people.apache.org/~pwendell/spark-releases/spark-1.5.1-rc1-bin/

 On Sun, Oct 4, 2015 at 5:17 PM, Nicholas Chammas <
 nicholas.cham...@gmail.com> wrote:

> I’m looking here:
>
> https://s3.amazonaws.com/spark-related-packages/
>
> I believe this is where one set of official packages is published.
> Please correct me if this is not the case.
>
> It appears that almost every version of Spark up to and including
> 1.5.0 has included a --bin-hadoop1.tgz release (e.g.
> spark-1.5.0-bin-hadoop1.tgz).
>
> However, 1.5.1 has no such package. There is a
> spark-1.5.1-bin-hadoop1-scala2.11.tgz package, but this is a
> separate thing. (1.5.0 also has a hadoop1-scala2.11 package.)
>
> Was this intentional?
>
> More importantly, is there some rough specification for what
> packages we should be able to expect in this S3 bucket with every 
> release?
>
> This is important for those of us who depend on this publishing
> venue (e.g. spark-ec2 and related tools).
>
> Nick
> ​
>


>>>
>
>>>
>


Re: Spark 1.5.1 - Scala 2.10 - Hadoop 1 package is missing from S3

2015-10-07 Thread Sean Owen
This is about the s3.amazonaws.com files, not dist.apache.org right?
or does it affect both?

(BTW you can keep as many old release artifacts around on the
apache.org archives as you like; I think the suggestion is to remove
all but the most recent releases from the set that's replicated to all
the Apache mirrors.)

On Wed, Oct 7, 2015 at 6:57 PM, Patrick Wendell  wrote:
> I don't think we have a firm contract around that. So far we've never
> removed old artifacts, but the ASF has asked us at time to decrease the size
> of binaries we post. In the future at some point we may drop older ones
> since we keep adding new ones.
>
> If downstream projects are depending on our artifacts, I'd say just hold
> tight for now until something changes. If it changes, then those projects
> might need to build Spark on their own and host older hadoop versions, etc.
>
> On Wed, Oct 7, 2015 at 9:59 AM, Nicholas Chammas
>  wrote:
>>
>> Thanks guys.
>>
>> Regarding this earlier question:
>>
>> More importantly, is there some rough specification for what packages we
>> should be able to expect in this S3 bucket with every release?
>>
>> Is the implied answer that we should continue to expect the same set of
>> artifacts for every release for the foreseeable future?
>>
>> Nick
>>
>>
>> On Tue, Oct 6, 2015 at 1:13 AM Patrick Wendell  wrote:
>>>
>>> The missing artifacts are uploaded now. Things should propagate in the
>>> next 24 hours. If there are still issues past then ping this thread. Thanks!
>>>
>>> - Patrick
>>>
>>> On Mon, Oct 5, 2015 at 2:41 PM, Nicholas Chammas
>>>  wrote:

 Thanks for looking into this Josh.

 On Mon, Oct 5, 2015 at 5:39 PM Josh Rosen 
 wrote:
>
> I'm working on a fix for this right now. I'm planning to re-run a
> modified copy of the release packaging scripts which will emit only the
> missing artifacts (so we won't upload new artifacts with different SHAs 
> for
> the builds which did succeed).
>
> I expect to have this finished in the next day or so; I'm currently
> blocked by some infra downtime but expect that to be resolved soon.
>
> - Josh
>
> On Mon, Oct 5, 2015 at 8:46 AM, Nicholas Chammas
>  wrote:
>>
>> Blaž said:
>>
>> Also missing is
>> http://s3.amazonaws.com/spark-related-packages/spark-1.5.1-bin-hadoop1.tgz
>> which breaks spark-ec2 script.
>>
>> This is the package I am referring to in my original email.
>>
>> Nick said:
>>
>> It appears that almost every version of Spark up to and including
>> 1.5.0 has included a —bin-hadoop1.tgz release (e.g.
>> spark-1.5.0-bin-hadoop1.tgz). However, 1.5.1 has no such package.
>>
>> Nick
>>
>>
>> On Mon, Oct 5, 2015 at 3:27 AM Blaž Šnuderl  wrote:
>>>
>>> Also missing is
>>> http://s3.amazonaws.com/spark-related-packages/spark-1.5.1-bin-hadoop1.tgz
>>> which breaks spark-ec2 script.
>>>
>>> On Mon, Oct 5, 2015 at 5:20 AM, Ted Yu  wrote:

 hadoop1 package for Scala 2.10 wasn't in RC1 either:

 http://people.apache.org/~pwendell/spark-releases/spark-1.5.1-rc1-bin/

 On Sun, Oct 4, 2015 at 5:17 PM, Nicholas Chammas
  wrote:
>
> I’m looking here:
>
> https://s3.amazonaws.com/spark-related-packages/
>
> I believe this is where one set of official packages is published.
> Please correct me if this is not the case.
>
> It appears that almost every version of Spark up to and including
> 1.5.0 has included a --bin-hadoop1.tgz release (e.g.
> spark-1.5.0-bin-hadoop1.tgz).
>
> However, 1.5.1 has no such package. There is a
> spark-1.5.1-bin-hadoop1-scala2.11.tgz package, but this is a separate 
> thing.
> (1.5.0 also has a hadoop1-scala2.11 package.)
>
> Was this intentional?
>
> More importantly, is there some rough specification for what
> packages we should be able to expect in this S3 bucket with every 
> release?
>
> This is important for those of us who depend on this publishing
> venue (e.g. spark-ec2 and related tools).
>
> Nick


>>>
>
>>>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Spark 1.5.1 - Scala 2.10 - Hadoop 1 package is missing from S3

2015-10-07 Thread Patrick Wendell
I don't think we have a firm contract around that. So far we've never
removed old artifacts, but the ASF has asked us at time to decrease the
size of binaries we post. In the future at some point we may drop older
ones since we keep adding new ones.

If downstream projects are depending on our artifacts, I'd say just hold
tight for now until something changes. If it changes, then those projects
might need to build Spark on their own and host older hadoop versions, etc.

On Wed, Oct 7, 2015 at 9:59 AM, Nicholas Chammas  wrote:

> Thanks guys.
>
> Regarding this earlier question:
>
> More importantly, is there some rough specification for what packages we
> should be able to expect in this S3 bucket with every release?
>
> Is the implied answer that we should continue to expect the same set of
> artifacts for every release for the foreseeable future?
>
> Nick
> ​
>
> On Tue, Oct 6, 2015 at 1:13 AM Patrick Wendell  wrote:
>
>> The missing artifacts are uploaded now. Things should propagate in the
>> next 24 hours. If there are still issues past then ping this thread. Thanks!
>>
>> - Patrick
>>
>> On Mon, Oct 5, 2015 at 2:41 PM, Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>>> Thanks for looking into this Josh.
>>>
>>> On Mon, Oct 5, 2015 at 5:39 PM Josh Rosen 
>>> wrote:
>>>
 I'm working on a fix for this right now. I'm planning to re-run a
 modified copy of the release packaging scripts which will emit only the
 missing artifacts (so we won't upload new artifacts with different SHAs for
 the builds which *did* succeed).

 I expect to have this finished in the next day or so; I'm currently
 blocked by some infra downtime but expect that to be resolved soon.

 - Josh

 On Mon, Oct 5, 2015 at 8:46 AM, Nicholas Chammas <
 nicholas.cham...@gmail.com> wrote:

> Blaž said:
>
> Also missing is
> http://s3.amazonaws.com/spark-related-packages/spark-1.5.1-bin-hadoop1.tgz
> which breaks spark-ec2 script.
>
> This is the package I am referring to in my original email.
>
> Nick said:
>
> It appears that almost every version of Spark up to and including
> 1.5.0 has included a —bin-hadoop1.tgz release (e.g.
> spark-1.5.0-bin-hadoop1.tgz). However, 1.5.1 has no such package.
>
> Nick
> ​
>
> On Mon, Oct 5, 2015 at 3:27 AM Blaž Šnuderl  wrote:
>
>> Also missing is http://s3.amazonaws.com/spark-related-packages/spark-
>> 1.5.1-bin-hadoop1.tgz which breaks spark-ec2 script.
>>
>> On Mon, Oct 5, 2015 at 5:20 AM, Ted Yu  wrote:
>>
>>> hadoop1 package for Scala 2.10 wasn't in RC1 either:
>>>
>>> http://people.apache.org/~pwendell/spark-releases/spark-1.5.1-rc1-bin/
>>>
>>> On Sun, Oct 4, 2015 at 5:17 PM, Nicholas Chammas <
>>> nicholas.cham...@gmail.com> wrote:
>>>
 I’m looking here:

 https://s3.amazonaws.com/spark-related-packages/

 I believe this is where one set of official packages is published.
 Please correct me if this is not the case.

 It appears that almost every version of Spark up to and including
 1.5.0 has included a --bin-hadoop1.tgz release (e.g.
 spark-1.5.0-bin-hadoop1.tgz).

 However, 1.5.1 has no such package. There is a
 spark-1.5.1-bin-hadoop1-scala2.11.tgz package, but this is a
 separate thing. (1.5.0 also has a hadoop1-scala2.11 package.)

 Was this intentional?

 More importantly, is there some rough specification for what
 packages we should be able to expect in this S3 bucket with every 
 release?

 This is important for those of us who depend on this publishing
 venue (e.g. spark-ec2 and related tools).

 Nick
 ​

>>>
>>>
>>

>>


Re: Spark 1.5.1 - Scala 2.10 - Hadoop 1 package is missing from S3

2015-10-07 Thread Nicholas Chammas
Thanks guys.

Regarding this earlier question:

More importantly, is there some rough specification for what packages we
should be able to expect in this S3 bucket with every release?

Is the implied answer that we should continue to expect the same set of
artifacts for every release for the foreseeable future?

Nick
​

On Tue, Oct 6, 2015 at 1:13 AM Patrick Wendell  wrote:

> The missing artifacts are uploaded now. Things should propagate in the
> next 24 hours. If there are still issues past then ping this thread. Thanks!
>
> - Patrick
>
> On Mon, Oct 5, 2015 at 2:41 PM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> Thanks for looking into this Josh.
>>
>> On Mon, Oct 5, 2015 at 5:39 PM Josh Rosen 
>> wrote:
>>
>>> I'm working on a fix for this right now. I'm planning to re-run a
>>> modified copy of the release packaging scripts which will emit only the
>>> missing artifacts (so we won't upload new artifacts with different SHAs for
>>> the builds which *did* succeed).
>>>
>>> I expect to have this finished in the next day or so; I'm currently
>>> blocked by some infra downtime but expect that to be resolved soon.
>>>
>>> - Josh
>>>
>>> On Mon, Oct 5, 2015 at 8:46 AM, Nicholas Chammas <
>>> nicholas.cham...@gmail.com> wrote:
>>>
 Blaž said:

 Also missing is
 http://s3.amazonaws.com/spark-related-packages/spark-1.5.1-bin-hadoop1.tgz
 which breaks spark-ec2 script.

 This is the package I am referring to in my original email.

 Nick said:

 It appears that almost every version of Spark up to and including 1.5.0
 has included a —bin-hadoop1.tgz release (e.g. spark-1.5.0-bin-hadoop1.tgz).
 However, 1.5.1 has no such package.

 Nick
 ​

 On Mon, Oct 5, 2015 at 3:27 AM Blaž Šnuderl  wrote:

> Also missing is http://s3.amazonaws.com/spark-related-packages/spark-
> 1.5.1-bin-hadoop1.tgz which breaks spark-ec2 script.
>
> On Mon, Oct 5, 2015 at 5:20 AM, Ted Yu  wrote:
>
>> hadoop1 package for Scala 2.10 wasn't in RC1 either:
>> http://people.apache.org/~pwendell/spark-releases/spark-1.5.1-rc1-bin/
>>
>> On Sun, Oct 4, 2015 at 5:17 PM, Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>>> I’m looking here:
>>>
>>> https://s3.amazonaws.com/spark-related-packages/
>>>
>>> I believe this is where one set of official packages is published.
>>> Please correct me if this is not the case.
>>>
>>> It appears that almost every version of Spark up to and including
>>> 1.5.0 has included a --bin-hadoop1.tgz release (e.g.
>>> spark-1.5.0-bin-hadoop1.tgz).
>>>
>>> However, 1.5.1 has no such package. There is a
>>> spark-1.5.1-bin-hadoop1-scala2.11.tgz package, but this is a
>>> separate thing. (1.5.0 also has a hadoop1-scala2.11 package.)
>>>
>>> Was this intentional?
>>>
>>> More importantly, is there some rough specification for what
>>> packages we should be able to expect in this S3 bucket with every 
>>> release?
>>>
>>> This is important for those of us who depend on this publishing
>>> venue (e.g. spark-ec2 and related tools).
>>>
>>> Nick
>>> ​
>>>
>>
>>
>
>>>
>


Re: Dataframes: PrunedFilteredScan without Spark Side Filtering

2015-10-07 Thread Michael Armbrust
Please do.

On Wed, Oct 7, 2015 at 9:49 AM, Russell Spitzer 
wrote:

> Should I make up a new ticket for this? Or is there something already
> underway?
>
> On Mon, Oct 5, 2015 at 4:31 PM Russell Spitzer 
> wrote:
>
>> That sounds fine to me, we already do the filtering so populating that
>> field would be pretty simple.
>>
>> On Sun, Sep 27, 2015 at 2:08 PM Michael Armbrust 
>> wrote:
>>
>>> We have to try and maintain binary compatibility here, so probably the
>>> easiest thing to do here would be to add a method to the class.  Perhaps
>>> something like:
>>>
>>> def unhandledFilters(filters: Array[Filter]): Array[Filter] = filters
>>>
>>> By default, this could return all filters so behavior would remain the
>>> same, but specific implementations could override it.  There is still a
>>> chance that this would conflict with existing methods, but hopefully that
>>> would not be a problem in practice.
>>>
>>> Thoughts?
>>>
>>> Michael
>>>
>>> On Fri, Sep 25, 2015 at 10:02 PM, Russell Spitzer <
>>> russell.spit...@gmail.com> wrote:
>>>
 Hi! First time poster, long time reader.

 I'm wondering if there is a way to let cataylst know that it doesn't
 need to repeat a filter on the spark side after a filter has been applied
 by the Source Implementing PrunedFilterScan.


 This is for a usecase in which we except a filter on a non-existant
 column that serves as an entry point for our integration with a different
 system. While the source can correctly deal with this, the secondary filter
 done on the RDD itself wipes out the results because the column being
 filtered does not exist.

 In particular this is with our integration with Solr where we allow
 users to pass in a predicate based on "solr_query" ala ("where
 solr_query='*:*') there is no column "solr_query" so the rdd.filter(
 row.solr_query == "*:*') filters out all of the data since no row's will
 have that column.

 I'm thinking about a few solutions to this but they all seem a little
 hacky
 1) Try to manually remove the filter step from the query plan after our
 source handles the filter
 2) Populate the solr_query field being returned so they all
 automatically pass

 But I think the real solution is to add a way to create a
 PrunedFilterScan which does not reapply filters if the source doesn't want
 it to. IE Giving PrunedFilterScan the ability to trust the underlying
 source that the filter will be accurately applied. Maybe changing the api
 to

 PrunedFilterScan(requiredColumns: Array[String], filters:
 Array[Filter], reapply: Boolean = true)

 Where Catalyst can check the Reapply value and not add an RDD.filter if
 it is false.

 Thoughts?

 Thanks for your time,
 Russ

>>>
>>>


Spark standalone hangup during shuffle flatMap or explode in cluster

2015-10-07 Thread Saif.A.Ellafi
When running stand-alone cluster mode job, the process hangs up randomly during 
a DataFrame flatMap or explode operation, in HiveContext:

-->> df.flatMap(r => for (n <- 1 to r.getInt(ind)) yield r)

This does not happen either with SQLContext in cluster, or Hive/SQL in local 
mode, where it works fine.

A couple minutes after the hangup, executors start dropping. I am attching the 
logs

Saif




15/10/07 12:15:19 INFO TaskSetManager: Finished task 50.0 in stage 17.0 (TID 
166) in 2511 ms on 162.101.194.47 (180/200)
15/10/07 12:15:19 INFO TaskSetManager: Finished task 66.0 in stage 17.0 (TID 
182) in 2510 ms on 162.101.194.47 (181/200)
15/10/07 12:15:19 INFO TaskSetManager: Finished task 110.0 in stage 17.0 (TID 
226) in 2505 ms on 162.101.194.47 (182/200)
15/10/07 12:15:19 INFO TaskSetManager: Finished task 74.0 in stage 17.0 (TID 
190) in 2530 ms on 162.101.194.47 (183/200)
15/10/07 12:15:19 INFO TaskSetManager: Finished task 106.0 in stage 17.0 (TID 
222) in 2530 ms on 162.101.194.47 (184/200)
15/10/07 12:20:01 WARN HeartbeatReceiver: Removing executor 2 with no recent 
heartbeats: 141447 ms exceeds timeout 12 ms
15/10/07 12:20:01 ERROR TaskSchedulerImpl: Lost executor 2 on 162.101.194.44: 
Executor heartbeat timed out after 141447 ms
15/10/07 12:20:01 INFO TaskSetManager: Re-queueing tasks for 2 from TaskSet 17.0
15/10/07 12:20:01 WARN TaskSetManager: Lost task 113.0 in stage 17.0 (TID 229, 
162.101.194.44): ExecutorLostFailure (executor 2 lost)
15/10/07 12:20:01 WARN TaskSetManager: Lost task 73.0 in stage 17.0 (TID 189, 
162.101.194.44): ExecutorLostFailure (executor 2 lost)
15/10/07 12:20:01 WARN TaskSetManager: Lost task 81.0 in stage 17.0 (TID 197, 
162.101.194.44): ExecutorLostFailure (executor 2 lost)
15/10/07 12:20:01 INFO TaskSetManager: Starting task 81.1 in stage 17.0 (TID 
316, 162.101.194.45, PROCESS_LOCAL, 2045 bytes)
15/10/07 12:20:01 INFO TaskSetManager: Starting task 73.1 in stage 17.0 (TID 
317, 162.101.194.44, PROCESS_LOCAL, 2045 bytes)
15/10/07 12:20:01 INFO TaskSetManager: Starting task 113.1 in stage 17.0 (TID 
318, 162.101.194.48, PROCESS_LOCAL, 2045 bytes)
15/10/07 12:20:01 INFO SparkDeploySchedulerBackend: Requesting to kill 
executor(s) 2
15/10/07 12:20:01 INFO DAGScheduler: Executor lost: 2 (epoch 4)
15/10/07 12:20:01 INFO BlockManagerMasterEndpoint: Trying to remove executor 2 
from BlockManagerMaster.
15/10/07 12:20:01 INFO BlockManagerMasterEndpoint: Removing block manager 
BlockManagerId(2, 162.101.194.44, 42537)
15/10/07 12:20:01 INFO BlockManagerMaster: Removed 2 successfully in 
removeExecutor
15/10/07 12:20:01 INFO ShuffleMapStage: ShuffleMapStage 15 is now unavailable 
on executor 2 (1/2, false)
15/10/07 12:20:01 INFO ShuffleMapStage: ShuffleMapStage 16 is now unavailable 
on executor 2 (8/16, false)
15/10/07 12:20:01 INFO DAGScheduler: Host added was in lost list earlier: 
162.101.194.44
15/10/07 12:20:01 INFO AppClient$ClientEndpoint: Executor added: 
app-20151007121501-0022/69 on worker-20151007063932-162.101.194.44-57091 
(162.101.194.44:57091) with 32 cores
15/10/07 12:20:01 INFO SparkDeploySchedulerBackend: Granted executor ID 
app-20151007121501-0022/69 on hostPort 162.101.194.44:57091 with 32 cores, 
100.0 GB RAM
15/10/07 12:20:01 INFO AppClient$ClientEndpoint: Executor updated: 
app-20151007121501-0022/69 is now RUNNING
15/10/07 12:20:01 INFO AppClient$ClientEndpoint: Executor updated: 
app-20151007121501-0022/69 is now LOADING
15/10/07 12:20:01 INFO AppClient$ClientEndpoint: Executor updated: 
app-20151007121501-0022/69 is now EXITED (Command exited with code 1)
15/10/07 12:20:01 INFO SparkDeploySchedulerBackend: Executor 
app-20151007121501-0022/69 removed: Command exited with code 1
15/10/07 12:20:01 INFO SparkDeploySchedulerBackend: Asked to remove 
non-existent executor 69
15/10/07 12:20:01 INFO AppClient$ClientEndpoint: Executor added: 
app-20151007121501-0022/70 on worker-20151007063932-162.101.194.44-57091 
(162.101.194.44:57091) with 32 cores
15/10/07 12:20:01 INFO SparkDeploySchedulerBackend: Granted executor ID 
app-20151007121501-0022/70 on hostPort 162.101.194.44:57091 with 32 cores, 
100.0 GB RAM
15/10/07 12:20:01 INFO AppClient$ClientEndpoint: Executor updated: 
app-20151007121501-0022/70 is now RUNNING
15/10/07 12:20:01 INFO AppClient$ClientEndpoint: Executor updated: 
app-20151007121501-0022/70 is now LOADING
15/10/07 12:20:01 INFO AppClient$ClientEndpoint: Executor updated: 
app-20151007121501-0022/70 is now EXITED (Command exited with code 1)
15/10/07 12:20:01 INFO SparkDeploySchedulerBackend: Executor 
app-20151007121501-0022/70 removed: Command exited with code 1
15/10/07 12:20:01 INFO SparkDeploySchedulerBackend: Asked to remove 
non-existent executor 70
15/10/07 12:20:01 INFO AppClient$ClientEndpoint: Executor added: 
app-20151007121501-0022/71 on worker-20151007063932-162.101.194.44-57091 
(162.101.194.44:57091) with 32 cores
15/10/07 12:20:01 INFO SparkDeploySchedulerBackend: Granted executor ID 
app-20151007121501-0022/71 on 

Re: Dataframes: PrunedFilteredScan without Spark Side Filtering

2015-10-07 Thread Russell Spitzer
Should I make up a new ticket for this? Or is there something already
underway?

On Mon, Oct 5, 2015 at 4:31 PM Russell Spitzer 
wrote:

> That sounds fine to me, we already do the filtering so populating that
> field would be pretty simple.
>
> On Sun, Sep 27, 2015 at 2:08 PM Michael Armbrust 
> wrote:
>
>> We have to try and maintain binary compatibility here, so probably the
>> easiest thing to do here would be to add a method to the class.  Perhaps
>> something like:
>>
>> def unhandledFilters(filters: Array[Filter]): Array[Filter] = filters
>>
>> By default, this could return all filters so behavior would remain the
>> same, but specific implementations could override it.  There is still a
>> chance that this would conflict with existing methods, but hopefully that
>> would not be a problem in practice.
>>
>> Thoughts?
>>
>> Michael
>>
>> On Fri, Sep 25, 2015 at 10:02 PM, Russell Spitzer <
>> russell.spit...@gmail.com> wrote:
>>
>>> Hi! First time poster, long time reader.
>>>
>>> I'm wondering if there is a way to let cataylst know that it doesn't
>>> need to repeat a filter on the spark side after a filter has been applied
>>> by the Source Implementing PrunedFilterScan.
>>>
>>>
>>> This is for a usecase in which we except a filter on a non-existant
>>> column that serves as an entry point for our integration with a different
>>> system. While the source can correctly deal with this, the secondary filter
>>> done on the RDD itself wipes out the results because the column being
>>> filtered does not exist.
>>>
>>> In particular this is with our integration with Solr where we allow
>>> users to pass in a predicate based on "solr_query" ala ("where
>>> solr_query='*:*') there is no column "solr_query" so the rdd.filter(
>>> row.solr_query == "*:*') filters out all of the data since no row's will
>>> have that column.
>>>
>>> I'm thinking about a few solutions to this but they all seem a little
>>> hacky
>>> 1) Try to manually remove the filter step from the query plan after our
>>> source handles the filter
>>> 2) Populate the solr_query field being returned so they all
>>> automatically pass
>>>
>>> But I think the real solution is to add a way to create a
>>> PrunedFilterScan which does not reapply filters if the source doesn't want
>>> it to. IE Giving PrunedFilterScan the ability to trust the underlying
>>> source that the filter will be accurately applied. Maybe changing the api
>>> to
>>>
>>> PrunedFilterScan(requiredColumns: Array[String], filters: Array[Filter],
>>> reapply: Boolean = true)
>>>
>>> Where Catalyst can check the Reapply value and not add an RDD.filter if
>>> it is false.
>>>
>>> Thoughts?
>>>
>>> Thanks for your time,
>>> Russ
>>>
>>
>>


Re: What is the difference between ml.classification.LogisticRegression and mllib.classification.LogisticRegressionWithLBFGS

2015-10-07 Thread Joseph Bradley
Hi YiZhi Liu,

The spark.ml classes are part of the higher-level "Pipelines" API, which
works with DataFrames.  When creating this API, we decided to separate it
from the old API to avoid confusion.  You can read more about it here:
http://spark.apache.org/docs/latest/ml-guide.html

For (3): We use Breeze, but we have to modify it in order to do distributed
optimization based on Spark.

Joseph

On Tue, Oct 6, 2015 at 11:47 PM, YiZhi Liu  wrote:

> Hi everyone,
>
> I'm curious about the difference between
> ml.classification.LogisticRegression and
> mllib.classification.LogisticRegressionWithLBFGS. Both of them are
> optimized using LBFGS, the only difference I see is LogisticRegression
> takes DataFrame while LogisticRegressionWithLBFGS takes RDD.
>
> So I wonder,
> 1. Why not simply add a DataFrame training interface to
> LogisticRegressionWithLBFGS?
> 2. Whats the difference between ml.classification and
> mllib.classification package?
> 3. Why doesn't ml.classification.LogisticRegression call
> mllib.optimization.LBFGS / mllib.optimization.OWLQN directly? Instead,
> it uses breeze.optimize.LBFGS and re-implements most of the procedures
> in mllib.optimization.{LBFGS,OWLQN}.
>
> Thank you.
>
> Best,
>
> --
> Yizhi Liu
> Senior Software Engineer / Data Mining
> www.mvad.com, Shanghai, China
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>