Re: Hadoop LR comparison

2014-04-01 Thread DB Tsai
Hi Li-Ming,

This binary logistic regression using SGD is in
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala

We're working on multinomial logistic regression using Newton and L-BFGS
optimizer now. Will be released soon.


Sincerely,

DB Tsai
Machine Learning Engineer
Alpine Data Labs
--
Web: http://alpinenow.com/


On Mon, Mar 31, 2014 at 11:38 PM, Tsai Li Ming mailingl...@ltsai.comwrote:

 Hi,

 Is the code available for Hadoop to calculate the Logistic Regression
 hyperplane?

 I’m looking at the Examples:
 http://spark.apache.org/examples.html,

 where there is the 110s vs 0.9s in Hadoop vs Spark comparison.

 Thanks!


Re: Hadoop LR comparison

2014-04-01 Thread Tsai Li Ming
Thanks.

What will be equivalent code in Hadoop where Spark published the 110s/0.9s 
comparison?


On 1 Apr, 2014, at 2:44 pm, DB Tsai dbt...@alpinenow.com wrote:

 Hi Li-Ming,
 
 This binary logistic regression using SGD is in 
 https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala
 
 We're working on multinomial logistic regression using Newton and L-BFGS 
 optimizer now. Will be released soon.
 
 
 Sincerely,
 
 DB Tsai
 Machine Learning Engineer
 Alpine Data Labs
 --
 Web: http://alpinenow.com/
 
 
 On Mon, Mar 31, 2014 at 11:38 PM, Tsai Li Ming mailingl...@ltsai.com wrote:
 Hi,
 
 Is the code available for Hadoop to calculate the Logistic Regression 
 hyperplane?
 
 I’m looking at the Examples:
 http://spark.apache.org/examples.html,
 
 where there is the 110s vs 0.9s in Hadoop vs Spark comparison.
 
 Thanks!
 



Re: Configuring distributed caching with Spark and YARN

2014-04-01 Thread santhoma
I think with addJar() there is no 'caching',  in the sense files will be
copied everytime per job.
Whereas in hadoop distributed cache, files will be copied only once, and a
symlink will be created to the cache file for subsequent runs:
https://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/filecache/DistributedCache.html

Also,hadoop distributed cache can copy an archive  file to the node and
unzip it automatically to current working dir. The advantage here is that
the copying will be very fast..

Still looking for similar  mechanisms in SPARK




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Configuring-distributed-caching-with-Spark-and-YARN-tp1074p3566.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Using ProtoBuf 2.5 for messages with Spark Streaming

2014-04-01 Thread Vipul Pandey
 Spark now shades its own protobuf dependency so protobuf 2.4.1 should't be 
 getting pulled in unless you are directly using akka yourself. Are you?

No i'm not. Although I see that protobuf libraries are directly pulled into the 
0.9.0 assembly jar - I do see the shaded version as well. 
e.g. below for Message.class

-bash-4.1$ jar -ftv 
./assembly/target/scala-2.10/spark-assembly-0.9.0-incubating-hadoop2.0.0-cdh4.2.1.jar
 | grep protobuf | grep /Message.class
   478 Thu Jun 30 15:26:12 PDT 2011 com/google/protobuf/Message.class
   508 Sat Dec 14 14:20:38 PST 2013 com/google/protobuf_spark/Message.class


 Does your project have other dependencies that might be indirectly pulling in 
 protobuf 2.4.1? It would be helpful if you could list all of your 
 dependencies including the exact Spark version and other libraries.

I did have another one which I moved to the end of classpath - even ran partial 
code without that dependency but it still failed whenever I use the jar with 
ScalaBuf dependency. 
Spark version is 0.9.0


~Vipul

On Mar 31, 2014, at 4:51 PM, Patrick Wendell pwend...@gmail.com wrote:

 Spark now shades its own protobuf dependency so protobuf 2.4.1 should't be 
 getting pulled in unless you are directly using akka yourself. Are you?
 
 Does your project have other dependencies that might be indirectly pulling in 
 protobuf 2.4.1? It would be helpful if you could list all of your 
 dependencies including the exact Spark version and other libraries.
 
 - Patrick
 
 
 On Sun, Mar 30, 2014 at 10:03 PM, Vipul Pandey vipan...@gmail.com wrote:
 I'm using ScalaBuff (which depends on protobuf2.5) and facing the same issue. 
 any word on this one?
 On Mar 27, 2014, at 6:41 PM, Kanwaldeep kanwal...@gmail.com wrote:
 
  We are using Protocol Buffer 2.5 to send messages to Spark Streaming 0.9 
  with
  Kafka stream setup. I have protocol Buffer 2.5 part of the uber jar deployed
  on each of the spark worker nodes.
  The message is compiled using 2.5 but then on runtime it is being
  de-serialized by 2.4.1 as I'm getting the following exception
 
  java.lang.VerifyError (java.lang.VerifyError: class
  com.snc.sinet.messages.XServerMessage$XServer overrides final method
  getUnknownFields.()Lcom/google/protobuf/UnknownFieldSet;)
  java.lang.ClassLoader.defineClass1(Native Method)
  java.lang.ClassLoader.defineClassCond(ClassLoader.java:631)
  java.lang.ClassLoader.defineClass(ClassLoader.java:615)
  java.security.SecureClassLoader.defineClass(SecureClassLoader.java:141)
 
  Suggestions on how I could still use ProtoBuf 2.5. Based on the article -
  https://spark-project.atlassian.net/browse/SPARK-995 we should be able to
  use different version of protobuf in the application.
 
 
 
 
 
  --
  View this message in context: 
  http://apache-spark-user-list.1001560.n3.nabble.com/Using-ProtoBuf-2-5-for-messages-with-Spark-Streaming-tp3396.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
 



SSH problem

2014-04-01 Thread Sai Prasanna
Hi All,

I have a five node spark cluster, Master, s1,s2,s3,s4.

I have passwordless ssh to all slaves from master and vice-versa.
But only one machine, s2, what happens is after 2-3 minutes of my
connection from master to slave, the write-pipe is broken. So if try to
connect again from master i get the error,
ssh: Connect to host s1 port 22: Connection refused.
But i can login from s2 to master, and after doing it for next 2-3 min i
get access from master to s2.

What is this strange behaviour ?
I have set complete access in hosts.allow also !!


Sliding Subwindows

2014-04-01 Thread aecc
Hello, I would like to have a kind of sub windows. The idea is to have 3
windows in the following way:

future    - --   
 past
 w1 w2   w3

So I can do some processing with the new data coming (w1) to main main
window (w2) and some processing on the data leaving the window (w3)

Any ideas of how can I do this in Spark? Is there a way to create sub
windows? or to specify when a window should start reading?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Sliding-Subwindows-tp3572.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


foreach not working

2014-04-01 Thread eric perler
hello..
i am on my second day with spark.. and im having trouble getting the foreach 
function to work with the network wordcount example.. i can see the the 
flatMap and map methods are being invoked.. but i dont seem to be getting 
into the foreach method... not sure if what i am doing even makes sense.. any 
help is appreciated... thx !!!
JavaDStreamString lines = ssc.socketTextStream(localhost, 
1234);
JavaDStreamString words = lines   
.flatMap(new FlatMapFunctionString, String() {
@Override   public IterableString 
call(String x) {//-- this is 
being invoked  return 
Lists.newArrayList(x.split( ));}  
 });
JavaPairDStreamString, Integer wordCounts = words 
.map(new PairFunctionString, String, Integer() {  
@Override   public 
Tuple2String, Integer call(String s) throws Exception {   
 //-- this is being invoked 
 return new Tuple2String, Integer(s, 1);  
 }   });
wordCounts.foreach(collectTuplesFunc);
ssc.start();ssc.awaitTermination(); }
Function collectTuplesFunc = new FunctionJavaPairRDDTuple2byte[], 
byte[], Void, Void() {
@Override   public Void 
call(JavaPairRDDTuple2byte[], byte[], Void arg0)
throws Exception {  //-- this is NOT being invoked  
return null;}   };
i am assuming that in the foreach call is where i would write to an external 
system.. please correct me if this assumption is wrong
thanks again  

Re: Unable to submit an application to standalone cluster which on hdfs.

2014-04-01 Thread haikal.pribadi
How do you remove the validation blocker from the compilation?

Thank you



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-submit-an-application-to-standalone-cluster-which-on-hdfs-tp1730p3574.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


custom receiver in java

2014-04-01 Thread eric perler
i would like to write a custom receiver to receive data from a Tibco RV subject
i found this scala example..
http://spark.incubator.apache.org/docs/0.8.0/streaming-custom-receivers.html
but i cant seem to find a java example
does anybody know of a good java example for creating a custom receiver
thx   

Use combineByKey and StatCount

2014-04-01 Thread Jaonary Rabarisoa
Hi all;

Can someone give me some tips to compute mean of RDD by key , maybe with
combineByKey and StatCount.

Cheers,

Jaonary


Re: Is there a way to get the current progress of the job?

2014-04-01 Thread Mark Hamstra
Some related discussion: https://github.com/apache/spark/pull/246


On Tue, Apr 1, 2014 at 8:43 AM, Philip Ogren philip.og...@oracle.comwrote:

 Hi DB,

 Just wondering if you ever got an answer to your question about monitoring
 progress - either offline or through your own investigation.  Any findings
 would be appreciated.

 Thanks,
 Philip


 On 01/30/2014 10:32 PM, DB Tsai wrote:

 Hi guys,

 When we're running a very long job, we would like to show users the
 current progress of map and reduce job. After looking at the api document,
 I don't find anything for this. However, in Spark UI, I could see the
 progress of the task. Is there anything I miss?

 Thanks.

 Sincerely,

 DB Tsai
 Machine Learning Engineer
 Alpine Data Labs
 --
 Web: http://alpinenow.com/





Re: Mllib in pyspark for 0.8.1

2014-04-01 Thread Matei Zaharia
You could probably port it back, but it required some changes on the Java side 
as well (a new PythonMLUtils class). It might be easier to fix the Mesos issues 
with 0.9.

Matei

On Apr 1, 2014, at 8:53 AM, Ian Ferreira ianferre...@hotmail.com wrote:

 
 Hi there,
 
 For some reason the distribution and build for 0.8.1 does not include
 the MLLib libraries for pyspark i.e. import from mllib fails.
 
 Seems to be addressed in 0.9.0, but that has other issue running on
 mesos in standalone mode :)
 
 Any pointers?
 
 Cheers
 - Ian
 



Re: Using ProtoBuf 2.5 for messages with Spark Streaming

2014-04-01 Thread Kanwaldeep
Yes I'm using akka as well. But if that is the problem then I should have
been facing this issue in my local setup as well. I'm only running into this
error on using the spark standalone cluster.

But will try out your suggestion and let you know.

Thanks
Kanwal



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Using-ProtoBuf-2-5-for-messages-with-Spark-Streaming-tp3396p3582.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Is there a way to get the current progress of the job?

2014-04-01 Thread Kevin Markey

  
  
The discussion there hits on the distinction of jobs and stages.
When looking at one application, there are hundreds of stages,
sometimes thousands. Depends on the data and the task. And the UI
seems to track stages. And one could independently track them for
such a job. But what if -- as occurs in another application --
there's only one or two stages, but lots of data passing through
those 1 or 2 stages?

Kevin Markey


On 04/01/2014 09:55 AM, Mark Hamstra
  wrote:


  Some related discussion:https://github.com/apache/spark/pull/246
  

On Tue, Apr 1, 2014 at 8:43 AM, Philip
  Ogren philip.og...@oracle.com
  wrote:
  Hi DB,

Just wondering if you ever got an answer to your question
about monitoring progress - either offline or through your
own investigation. Any findings would be appreciated.

Thanks,
Philip

  

On 01/30/2014 10:32 PM, DB Tsai wrote:

  Hi guys,
  
  When we're running a very long job, we would like to
  show users the current progress of map and reduce job.
  After looking at the api document, I don't find
  anything for this. However, in Spark UI, I could see
  the progress of the task. Is there anything I miss?
  
  Thanks.
  
  Sincerely,
  
  DB Tsai
  Machine Learning Engineer
  Alpine Data Labs
  --
  Web: http://alpinenow.com/


  

  


  


  



Re: Using ProtoBuf 2.5 for messages with Spark Streaming

2014-04-01 Thread Kanwaldeep
I've removed the dependency on akka in a separate project but still running
into the same error. In the POM Dependency Hierarchy I do see 2.4.1 - shaded
and 2.5.0 being included. If there is a conflict with project dependency I
would think I should be getting the same error in my local setup as well.

Here is the dependencies I'm using.

dependencies
dependency
groupIdch.qos.logback/groupId
artifactIdlogback-core/artifactId
version1.0.13/version
/dependency
dependency
groupIdch.qos.logback/groupId
artifactIdlogback-classic/artifactId
version1.0.13/version
/dependency

dependency
groupIdorg.apache.spark/groupId
artifactIdspark-core_2.10/artifactId
version0.9.0-incubating/version

/dependency
dependency
groupIdorg.apache.spark/groupId
artifactIdspark-streaming_2.10/artifactId
version0.9.0-incubating/version
/dependency
dependency
groupIdorg.apache.spark/groupId
artifactIdspark-streaming-kafka_2.10/artifactId
version0.9.0-incubating/version

/dependency


  dependency
groupIdorg.apache.hbase/groupId
artifactIdhbase/artifactId
version0.94.15-cdh4.6.0/version

/dependency  

dependency
groupIdorg.apache.hadoop/groupId
artifactIdhadoop-client/artifactId
version2.0.0-cdh4.6.0/version
/dependency  
dependency
groupIdcom.google.protobuf/groupId
artifactIdprotobuf-java/artifactId
version2.5.0/version
/dependency 

dependency
groupIdorg.slf4j/groupId
artifactIdslf4j-api/artifactId
version1.7.5/version
/dependency


dependency
groupIdorg.scala-lang/groupId
artifactIdscala-library/artifactId
version2.10.2/version
/dependency

dependency
groupIdorg.scala-lang/groupId
artifactIdscala-actors/artifactId
version2.10.2/version
/dependency
dependency
groupIdorg.scala-lang/groupId
artifactIdscala-reflect/artifactId
version2.10.2/version
/dependency

dependency
groupIdorg.slf4j/groupId
artifactIdslf4j-api/artifactId
version1.7.5/version
/dependency






--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Using-ProtoBuf-2-5-for-messages-with-Spark-Streaming-tp3396p3585.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Using ProtoBuf 2.5 for messages with Spark Streaming

2014-04-01 Thread Vipul Pandey
SPARK_HADOOP_VERSION=2.0.0-cdh4.2.1 sbt/sbt assembly 

That's all I do. 

On Apr 1, 2014, at 11:41 AM, Patrick Wendell pwend...@gmail.com wrote:

 Vidal - could you show exactly what flags/commands you are using when you 
 build spark to produce this assembly?
 
 
 On Tue, Apr 1, 2014 at 12:53 AM, Vipul Pandey vipan...@gmail.com wrote:
 Spark now shades its own protobuf dependency so protobuf 2.4.1 should't be 
 getting pulled in unless you are directly using akka yourself. Are you?
 
 No i'm not. Although I see that protobuf libraries are directly pulled into 
 the 0.9.0 assembly jar - I do see the shaded version as well. 
 e.g. below for Message.class
 
 -bash-4.1$ jar -ftv 
 ./assembly/target/scala-2.10/spark-assembly-0.9.0-incubating-hadoop2.0.0-cdh4.2.1.jar
  | grep protobuf | grep /Message.class
478 Thu Jun 30 15:26:12 PDT 2011 com/google/protobuf/Message.class
508 Sat Dec 14 14:20:38 PST 2013 com/google/protobuf_spark/Message.class
 
 
 Does your project have other dependencies that might be indirectly pulling 
 in protobuf 2.4.1? It would be helpful if you could list all of your 
 dependencies including the exact Spark version and other libraries.
 
 I did have another one which I moved to the end of classpath - even ran 
 partial code without that dependency but it still failed whenever I use the 
 jar with ScalaBuf dependency. 
 Spark version is 0.9.0
 
 
 ~Vipul
 
 On Mar 31, 2014, at 4:51 PM, Patrick Wendell pwend...@gmail.com wrote:
 
 Spark now shades its own protobuf dependency so protobuf 2.4.1 should't be 
 getting pulled in unless you are directly using akka yourself. Are you?
 
 Does your project have other dependencies that might be indirectly pulling 
 in protobuf 2.4.1? It would be helpful if you could list all of your 
 dependencies including the exact Spark version and other libraries.
 
 - Patrick
 
 
 On Sun, Mar 30, 2014 at 10:03 PM, Vipul Pandey vipan...@gmail.com wrote:
 I'm using ScalaBuff (which depends on protobuf2.5) and facing the same 
 issue. any word on this one?
 On Mar 27, 2014, at 6:41 PM, Kanwaldeep kanwal...@gmail.com wrote:
 
  We are using Protocol Buffer 2.5 to send messages to Spark Streaming 0.9 
  with
  Kafka stream setup. I have protocol Buffer 2.5 part of the uber jar 
  deployed
  on each of the spark worker nodes.
  The message is compiled using 2.5 but then on runtime it is being
  de-serialized by 2.4.1 as I'm getting the following exception
 
  java.lang.VerifyError (java.lang.VerifyError: class
  com.snc.sinet.messages.XServerMessage$XServer overrides final method
  getUnknownFields.()Lcom/google/protobuf/UnknownFieldSet;)
  java.lang.ClassLoader.defineClass1(Native Method)
  java.lang.ClassLoader.defineClassCond(ClassLoader.java:631)
  java.lang.ClassLoader.defineClass(ClassLoader.java:615)
  java.security.SecureClassLoader.defineClass(SecureClassLoader.java:141)
 
  Suggestions on how I could still use ProtoBuf 2.5. Based on the article -
  https://spark-project.atlassian.net/browse/SPARK-995 we should be able to
  use different version of protobuf in the application.
 
 
 
 
 
  --
  View this message in context: 
  http://apache-spark-user-list.1001560.n3.nabble.com/Using-ProtoBuf-2-5-for-messages-with-Spark-Streaming-tp3396.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
 
 
 



Re: Best practices: Parallelized write to / read from S3

2014-04-01 Thread Nicholas Chammas
Alright, so I've upped the minSplits parameter on my call to textFile, but
the resulting RDD still has only 1 partition, which I assume means it was
read in on a single process. I am checking the number of partitions in
pyspark by using the rdd._jrdd.splits().size() trick I picked up on this
list.

The source file is a gzipped text file. I have heard things about gzipped
files not being splittable.

Is this the reason that opening the file with minSplits = 10 still gives me
an RDD with one partition? If so, I guess the only way to speed up the load
would be to change the source file's format to something splittable.

Otherwise, if I want to speed up subsequent computation on the RDD, I
should explicitly partition it with a call to RDD.partitionBy(10).

Is that correct?


On Mon, Mar 31, 2014 at 1:15 PM, Nicholas Chammas 
nicholas.cham...@gmail.com wrote:

 OK sweet. Thanks for walking me through that.

 I wish this were StackOverflow so I could bestow some nice rep on all you
 helpful people.


 On Mon, Mar 31, 2014 at 1:06 PM, Aaron Davidson ilike...@gmail.comwrote:

 Note that you may have minSplits set to more than the number of cores in
 the cluster, and Spark will just run as many as possible at a time. This is
 better if certain nodes may be slow, for instance.

 In general, it is not necessarily the case that doubling the number of
 cores doing IO will double the throughput, because you could be saturating
 the throughput with fewer cores. However, S3 is odd in that each connection
 gets way less bandwidth than your network link can provide, and it does
 seem to scale linearly with the number of connections. So, yes, taking
 minSplits up to 4 (or higher) will likely result in a 2x performance
 improvement.

 saveAsTextFile() will use as many partitions (aka splits) as the RDD it's
 being called on. So for instance:

 sc.textFile(myInputFile, 15).map(lambda x: x +
 !!!).saveAsTextFile(myOutputFile)

 will use 15 partitions to read the text file (i.e., up to 15 cores at a
 time) and then again to save back to S3.



 On Mon, Mar 31, 2014 at 9:46 AM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 So setting 
 minSplitshttp://spark.incubator.apache.org/docs/latest/api/pyspark/pyspark.context.SparkContext-class.html#textFile
  will
 set the parallelism on the read in SparkContext.textFile(), assuming I have
 the cores in the cluster to deliver that level of parallelism. And if I
 don't explicitly provide it, Spark will set the minSplits to 2.

 So for example, say I have a cluster with 4 cores total, and it takes 40
 minutes to read a single file from S3 with minSplits at 2. Tt should take
 roughly 20 minutes to read the same file if I up minSplits to 4.

 Did I understand that correctly?

 RDD.saveAsTextFile() doesn't have an analog to minSplits, so I'm
 guessing that's not an operation the user can tune.


 On Mon, Mar 31, 2014 at 12:29 PM, Aaron Davidson ilike...@gmail.comwrote:

 Spark will only use each core for one task at a time, so doing

 sc.textFile(s3 location, num reducers)

 where you set num reducers to at least as many as the total number of
 cores in your cluster, is about as fast you can get out of the box. Same
 goes for saveAsTextFile.


 On Mon, Mar 31, 2014 at 8:49 AM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 Howdy-doody,

 I have a single, very large file sitting in S3 that I want to read in
 with sc.textFile(). What are the best practices for reading in this file 
 as
 quickly as possible? How do I parallelize the read as much as possible?

 Similarly, say I have a single, very large RDD sitting in memory that
 I want to write out to S3 with RDD.saveAsTextFile(). What are the best
 practices for writing this file out as quickly as possible?

 Nick


 --
 View this message in context: Best practices: Parallelized write to /
 read from 
 S3http://apache-spark-user-list.1001560.n3.nabble.com/Best-practices-Parallelized-write-to-read-from-S3-tp3516.html
 Sent from the Apache Spark User List mailing list 
 archivehttp://apache-spark-user-list.1001560.n3.nabble.com/at 
 Nabble.com.








Generic types and pair RDDs

2014-04-01 Thread Daniel Siegmann
When my tuple type includes a generic type parameter, the pair RDD
functions aren't available. Take for example the following (a join on two
RDDs, taking the sum of the values):

def joinTest(rddA: RDD[(String, Int)], rddB: RDD[(String, Int)]) :
RDD[(String, Int)] = {
rddA.join(rddB).map { case (k, (a, b)) = (k, a+b) }
}

That works fine, but lets say I replace the type of the key with a generic
type:

def joinTest[K](rddA: RDD[(K, Int)], rddB: RDD[(K, Int)]) : RDD[(K, Int)] =
{
rddA.join(rddB).map { case (k, (a, b)) = (k, a+b) }
}

This latter function gets the compiler error value join is not a member of
org.apache.spark.rdd.RDD[(K, Int)].

The reason is probably obvious, but I don't have much Scala experience. Can
anyone explain what I'm doing wrong?

-- 
Daniel Siegmann, Software Developer
Velos
Accelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
E: daniel.siegm...@velos.io W: www.velos.io


Re: Best practices: Parallelized write to / read from S3

2014-04-01 Thread Aaron Davidson
Looks like you're right that gzip files are not easily splittable [1], and
also about everything else you said.

[1]
http://mail-archives.apache.org/mod_mbox/spark-user/201310.mbox/%3CCANDWdjY2hN-=jXTSNZ8JHZ=G-S+ZKLNze=rgkjacjaw3tto...@mail.gmail.com%3E




On Tue, Apr 1, 2014 at 1:51 PM, Nicholas Chammas nicholas.cham...@gmail.com
 wrote:

 Alright, so I've upped the minSplits parameter on my call to textFile, but
 the resulting RDD still has only 1 partition, which I assume means it was
 read in on a single process. I am checking the number of partitions in
 pyspark by using the rdd._jrdd.splits().size() trick I picked up on this
 list.

 The source file is a gzipped text file. I have heard things about gzipped
 files not being splittable.

 Is this the reason that opening the file with minSplits = 10 still gives
 me an RDD with one partition? If so, I guess the only way to speed up the
 load would be to change the source file's format to something splittable.

 Otherwise, if I want to speed up subsequent computation on the RDD, I
 should explicitly partition it with a call to RDD.partitionBy(10).

 Is that correct?


 On Mon, Mar 31, 2014 at 1:15 PM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 OK sweet. Thanks for walking me through that.

 I wish this were StackOverflow so I could bestow some nice rep on all you
 helpful people.


 On Mon, Mar 31, 2014 at 1:06 PM, Aaron Davidson ilike...@gmail.comwrote:

 Note that you may have minSplits set to more than the number of cores in
 the cluster, and Spark will just run as many as possible at a time. This is
 better if certain nodes may be slow, for instance.

 In general, it is not necessarily the case that doubling the number of
 cores doing IO will double the throughput, because you could be saturating
 the throughput with fewer cores. However, S3 is odd in that each connection
 gets way less bandwidth than your network link can provide, and it does
 seem to scale linearly with the number of connections. So, yes, taking
 minSplits up to 4 (or higher) will likely result in a 2x performance
 improvement.

 saveAsTextFile() will use as many partitions (aka splits) as the RDD
 it's being called on. So for instance:

 sc.textFile(myInputFile, 15).map(lambda x: x +
 !!!).saveAsTextFile(myOutputFile)

 will use 15 partitions to read the text file (i.e., up to 15 cores at a
 time) and then again to save back to S3.



 On Mon, Mar 31, 2014 at 9:46 AM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 So setting 
 minSplitshttp://spark.incubator.apache.org/docs/latest/api/pyspark/pyspark.context.SparkContext-class.html#textFile
  will
 set the parallelism on the read in SparkContext.textFile(), assuming I have
 the cores in the cluster to deliver that level of parallelism. And if I
 don't explicitly provide it, Spark will set the minSplits to 2.

 So for example, say I have a cluster with 4 cores total, and it takes
 40 minutes to read a single file from S3 with minSplits at 2. Tt should
 take roughly 20 minutes to read the same file if I up minSplits to 4.

 Did I understand that correctly?

 RDD.saveAsTextFile() doesn't have an analog to minSplits, so I'm
 guessing that's not an operation the user can tune.


 On Mon, Mar 31, 2014 at 12:29 PM, Aaron Davidson ilike...@gmail.comwrote:

 Spark will only use each core for one task at a time, so doing

 sc.textFile(s3 location, num reducers)

 where you set num reducers to at least as many as the total number
 of cores in your cluster, is about as fast you can get out of the box. 
 Same
 goes for saveAsTextFile.


 On Mon, Mar 31, 2014 at 8:49 AM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 Howdy-doody,

 I have a single, very large file sitting in S3 that I want to read in
 with sc.textFile(). What are the best practices for reading in this file 
 as
 quickly as possible? How do I parallelize the read as much as possible?

 Similarly, say I have a single, very large RDD sitting in memory that
 I want to write out to S3 with RDD.saveAsTextFile(). What are the best
 practices for writing this file out as quickly as possible?

 Nick


 --
 View this message in context: Best practices: Parallelized write to
 / read from 
 S3http://apache-spark-user-list.1001560.n3.nabble.com/Best-practices-Parallelized-write-to-read-from-S3-tp3516.html
 Sent from the Apache Spark User List mailing list 
 archivehttp://apache-spark-user-list.1001560.n3.nabble.com/at 
 Nabble.com.









Re: Generic types and pair RDDs

2014-04-01 Thread Koert Kuipers
  import org.apache.spark.SparkContext._
  import org.apache.spark.rdd.RDD
  import scala.reflect.ClassTag

  def joinTest[K: ClassTag](rddA: RDD[(K, Int)], rddB: RDD[(K, Int)]) :
RDD[(K, Int)] = {
rddA.join(rddB).map { case (k, (a, b)) = (k, a+b) }
  }


On Tue, Apr 1, 2014 at 4:55 PM, Daniel Siegmann daniel.siegm...@velos.iowrote:

 When my tuple type includes a generic type parameter, the pair RDD
 functions aren't available. Take for example the following (a join on two
 RDDs, taking the sum of the values):

 def joinTest(rddA: RDD[(String, Int)], rddB: RDD[(String, Int)]) :
 RDD[(String, Int)] = {
 rddA.join(rddB).map { case (k, (a, b)) = (k, a+b) }
 }

 That works fine, but lets say I replace the type of the key with a generic
 type:

 def joinTest[K](rddA: RDD[(K, Int)], rddB: RDD[(K, Int)]) : RDD[(K, Int)]
 = {
 rddA.join(rddB).map { case (k, (a, b)) = (k, a+b) }
 }

 This latter function gets the compiler error value join is not a member
 of org.apache.spark.rdd.RDD[(K, Int)].

 The reason is probably obvious, but I don't have much Scala experience.
 Can anyone explain what I'm doing wrong?

 --
 Daniel Siegmann, Software Developer
 Velos
 Accelerating Machine Learning

 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
 E: daniel.siegm...@velos.io W: www.velos.io



Re: Generic types and pair RDDs

2014-04-01 Thread Aaron Davidson
Koert's answer is very likely correct. This implicit definition which
converts an RDD[(K, V)] to provide PairRDDFunctions requires a ClassTag is
available for K:
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L1124

To fully understand what's going on from a Scala beginner's point of view,
you'll have to look up ClassTags, context bounds (the K : ClassTag
syntax), and implicit functions. Fortunately, you don't have to understand
monads...


On Tue, Apr 1, 2014 at 2:06 PM, Koert Kuipers ko...@tresata.com wrote:

   import org.apache.spark.SparkContext._
   import org.apache.spark.rdd.RDD
   import scala.reflect.ClassTag

   def joinTest[K: ClassTag](rddA: RDD[(K, Int)], rddB: RDD[(K, Int)]) :
 RDD[(K, Int)] = {

 rddA.join(rddB).map { case (k, (a, b)) = (k, a+b) }
   }


 On Tue, Apr 1, 2014 at 4:55 PM, Daniel Siegmann 
 daniel.siegm...@velos.iowrote:

 When my tuple type includes a generic type parameter, the pair RDD
 functions aren't available. Take for example the following (a join on two
 RDDs, taking the sum of the values):

 def joinTest(rddA: RDD[(String, Int)], rddB: RDD[(String, Int)]) :
 RDD[(String, Int)] = {
 rddA.join(rddB).map { case (k, (a, b)) = (k, a+b) }
 }

 That works fine, but lets say I replace the type of the key with a
 generic type:

 def joinTest[K](rddA: RDD[(K, Int)], rddB: RDD[(K, Int)]) : RDD[(K, Int)]
 = {
 rddA.join(rddB).map { case (k, (a, b)) = (k, a+b) }
 }

 This latter function gets the compiler error value join is not a member
 of org.apache.spark.rdd.RDD[(K, Int)].

 The reason is probably obvious, but I don't have much Scala experience.
 Can anyone explain what I'm doing wrong?

 --
 Daniel Siegmann, Software Developer
 Velos
 Accelerating Machine Learning

 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
 E: daniel.siegm...@velos.io W: www.velos.io





PySpark RDD.partitionBy() requires an RDD of tuples

2014-04-01 Thread Nicholas Chammas
Just an FYI, it's not obvious from the
docshttp://spark.incubator.apache.org/docs/latest/api/pyspark/pyspark.rdd.RDD-class.html#partitionBythat
the following code should fail:

a = sc.parallelize([1,2,3,4,5,6,7,8,9,10], 2)
a._jrdd.splits().size()
a.count()
b = a.partitionBy(5)
b._jrdd.splits().size()
b.count()

I figured out from the example that if I generated a key by doing this

b = a.map(lambda x: (x, x)).partitionBy(5)

then all would be well.

In other words, partitionBy() only works on RDDs of tuples. Is that correct?

Nick




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-RDD-partitionBy-requires-an-RDD-of-tuples-tp3598.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Cannot Access Web UI

2014-04-01 Thread yxzhao
http://spark.incubator.apache.org/docs/latest/spark-standalone.html#monitoring-and-logging
As the above shows:

Monitoring and Logging
Spark’s standalone mode offers a web-based user interface to monitor the
cluster. The master and each worker has its own web UI that shows cluster
and job statistics. By default you can access the web UI for the master at
port 8080. The port can be changed either in the configuration file or via
command-line options.


But I cannot open the web ui. The master ip is 10.1.255.202 so I input:
http://10.1.255.202 :8080 in my web browser. Bui the webpage is not
available.
Thanks.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Cannot-Access-Web-UI-tp3599.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Cannot Access Web UI

2014-04-01 Thread Nicholas Chammas
Are you trying to access the UI from another machine? If so, first confirm
that you don't have a network issue by opening the UI from the master node
itself.

For example:

yum -y install lynx
lynx ip_address:8080

If this succeeds, then you likely have something blocking you from
accessing the web page from another machine.

Nick


On Tue, Apr 1, 2014 at 6:30 PM, yxzhao yxz...@ualr.edu wrote:


http://spark.incubator.apache.org/docs/latest/spark-standalone.html#monitoring-and-logging
 As the above shows:
 
 Monitoring and Logging
 Spark's standalone mode offers a web-based user interface to monitor the
 cluster. The master and each worker has its own web UI that shows cluster
 and job statistics. By default you can access the web UI for the master at
 port 8080. The port can be changed either in the configuration file or via
 command-line options.

 
 But I cannot open the web ui. The master ip is 10.1.255.202 so I input:
 http://10.1.255.202 :8080 in my web browser. Bui the webpage is not
 available.
 Thanks.



 --
 View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Cannot-Access-Web-UI-tp3599.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Using ProtoBuf 2.5 for messages with Spark Streaming

2014-04-01 Thread Patrick Wendell
Do you get the same problem if you build with maven?


On Tue, Apr 1, 2014 at 12:23 PM, Vipul Pandey vipan...@gmail.com wrote:

 SPARK_HADOOP_VERSION=2.0.0-cdh4.2.1 sbt/sbt assembly

 That's all I do.

 On Apr 1, 2014, at 11:41 AM, Patrick Wendell pwend...@gmail.com wrote:

 Vidal - could you show exactly what flags/commands you are using when you
 build spark to produce this assembly?


 On Tue, Apr 1, 2014 at 12:53 AM, Vipul Pandey vipan...@gmail.com wrote:

 Spark now shades its own protobuf dependency so protobuf 2.4.1 should't
 be getting pulled in unless you are directly using akka yourself. Are you?

 No i'm not. Although I see that protobuf libraries are directly pulled
 into the 0.9.0 assembly jar - I do see the shaded version as well.
 e.g. below for Message.class

 -bash-4.1$ jar -ftv
 ./assembly/target/scala-2.10/spark-assembly-0.9.0-incubating-hadoop2.0.0-cdh4.2.1.jar
 | grep protobuf | grep /Message.class
478 Thu Jun 30 15:26:12 PDT 2011 com/google/protobuf/Message.class
508 Sat Dec 14 14:20:38 PST 2013
 com/google/protobuf_spark/Message.class


 Does your project have other dependencies that might be indirectly
 pulling in protobuf 2.4.1? It would be helpful if you could list all of
 your dependencies including the exact Spark version and other libraries.

 I did have another one which I moved to the end of classpath - even ran
 partial code without that dependency but it still failed whenever I use the
 jar with ScalaBuf dependency.
 Spark version is 0.9.0


 ~Vipul

 On Mar 31, 2014, at 4:51 PM, Patrick Wendell pwend...@gmail.com wrote:

 Spark now shades its own protobuf dependency so protobuf 2.4.1 should't
 be getting pulled in unless you are directly using akka yourself. Are you?

 Does your project have other dependencies that might be indirectly
 pulling in protobuf 2.4.1? It would be helpful if you could list all of
 your dependencies including the exact Spark version and other libraries.

 - Patrick


 On Sun, Mar 30, 2014 at 10:03 PM, Vipul Pandey vipan...@gmail.comwrote:

 I'm using ScalaBuff (which depends on protobuf2.5) and facing the same
 issue. any word on this one?
 On Mar 27, 2014, at 6:41 PM, Kanwaldeep kanwal...@gmail.com wrote:

  We are using Protocol Buffer 2.5 to send messages to Spark Streaming
 0.9 with
  Kafka stream setup. I have protocol Buffer 2.5 part of the uber jar
 deployed
  on each of the spark worker nodes.
  The message is compiled using 2.5 but then on runtime it is being
  de-serialized by 2.4.1 as I'm getting the following exception
 
  java.lang.VerifyError (java.lang.VerifyError: class
  com.snc.sinet.messages.XServerMessage$XServer overrides final method
  getUnknownFields.()Lcom/google/protobuf/UnknownFieldSet;)
  java.lang.ClassLoader.defineClass1(Native Method)
  java.lang.ClassLoader.defineClassCond(ClassLoader.java:631)
  java.lang.ClassLoader.defineClass(ClassLoader.java:615)
  java.security.SecureClassLoader.defineClass(SecureClassLoader.java:141)
 
  Suggestions on how I could still use ProtoBuf 2.5. Based on the
 article -
  https://spark-project.atlassian.net/browse/SPARK-995 we should be
 able to
  use different version of protobuf in the application.
 
 
 
 
 
  --
  View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Using-ProtoBuf-2-5-for-messages-with-Spark-Streaming-tp3396.html
  Sent from the Apache Spark User List mailing list archive at
 Nabble.com http://nabble.com/.








Re: PySpark RDD.partitionBy() requires an RDD of tuples

2014-04-01 Thread Aaron Davidson
Hm, yeah, the docs are not clear on this one. The function you're looking
for to change the number of partitions on any ol' RDD is repartition(),
which is available in master but for some reason doesn't seem to show up in
the latest docs. Sorry about that, I also didn't realize partitionBy() had
this behavior from reading the Python docs (though it is consistent with
the Scala API, just more type-safe there).


On Tue, Apr 1, 2014 at 3:01 PM, Nicholas Chammas nicholas.cham...@gmail.com
 wrote:

 Just an FYI, it's not obvious from the 
 docshttp://spark.incubator.apache.org/docs/latest/api/pyspark/pyspark.rdd.RDD-class.html#partitionBythat
  the following code should fail:

 a = sc.parallelize([1,2,3,4,5,6,7,8,9,10], 2)
 a._jrdd.splits().size()
 a.count()
 b = a.partitionBy(5)
 b._jrdd.splits().size()
 b.count()

 I figured out from the example that if I generated a key by doing this

 b = a.map(lambda x: (x, x)).partitionBy(5)

  then all would be well.

 In other words, partitionBy() only works on RDDs of tuples. Is that
 correct?

 Nick


 --
 View this message in context: PySpark RDD.partitionBy() requires an RDD
 of 
 tupleshttp://apache-spark-user-list.1001560.n3.nabble.com/PySpark-RDD-partitionBy-requires-an-RDD-of-tuples-tp3598.html
 Sent from the Apache Spark User List mailing list 
 archivehttp://apache-spark-user-list.1001560.n3.nabble.com/at Nabble.com.



Re: Best practices: Parallelized write to / read from S3

2014-04-01 Thread Nicholas Chammas
Alright!

Thanks for that link. I did little research based on it and it looks like
Snappy or LZO + some container would be better alternatives to gzip.

I confirmed that gzip was cramping my style by trying sc.textFile() on an
uncompressed version of the text file. With the uncompressed file, setting
minSplits gives me an RDD that is partitioned as expected. This makes all
my subsequent operations, obviously, much faster.

I dunno if it would be appropriate to have Spark issue some kind of warning
that Hey, your file is compressed using gzip so...

Anyway, mystery solved!

Nick


On Tue, Apr 1, 2014 at 5:03 PM, Aaron Davidson ilike...@gmail.com wrote:

 Looks like you're right that gzip files are not easily splittable [1], and
 also about everything else you said.

 [1]
 http://mail-archives.apache.org/mod_mbox/spark-user/201310.mbox/%3CCANDWdjY2hN-=jXTSNZ8JHZ=G-S+ZKLNze=rgkjacjaw3tto...@mail.gmail.com%3E




 On Tue, Apr 1, 2014 at 1:51 PM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 Alright, so I've upped the minSplits parameter on my call to textFile,
 but the resulting RDD still has only 1 partition, which I assume means it
 was read in on a single process. I am checking the number of partitions in
 pyspark by using the rdd._jrdd.splits().size() trick I picked up on this
 list.

 The source file is a gzipped text file. I have heard things about gzipped
 files not being splittable.

 Is this the reason that opening the file with minSplits = 10 still gives
 me an RDD with one partition? If so, I guess the only way to speed up the
 load would be to change the source file's format to something splittable.

 Otherwise, if I want to speed up subsequent computation on the RDD, I
 should explicitly partition it with a call to RDD.partitionBy(10).

 Is that correct?


 On Mon, Mar 31, 2014 at 1:15 PM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 OK sweet. Thanks for walking me through that.

 I wish this were StackOverflow so I could bestow some nice rep on all
 you helpful people.


 On Mon, Mar 31, 2014 at 1:06 PM, Aaron Davidson ilike...@gmail.comwrote:

 Note that you may have minSplits set to more than the number of cores
 in the cluster, and Spark will just run as many as possible at a time. This
 is better if certain nodes may be slow, for instance.

 In general, it is not necessarily the case that doubling the number of
 cores doing IO will double the throughput, because you could be saturating
 the throughput with fewer cores. However, S3 is odd in that each connection
 gets way less bandwidth than your network link can provide, and it does
 seem to scale linearly with the number of connections. So, yes, taking
 minSplits up to 4 (or higher) will likely result in a 2x performance
 improvement.

 saveAsTextFile() will use as many partitions (aka splits) as the RDD
 it's being called on. So for instance:

 sc.textFile(myInputFile, 15).map(lambda x: x +
 !!!).saveAsTextFile(myOutputFile)

 will use 15 partitions to read the text file (i.e., up to 15 cores at a
 time) and then again to save back to S3.



 On Mon, Mar 31, 2014 at 9:46 AM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 So setting 
 minSplitshttp://spark.incubator.apache.org/docs/latest/api/pyspark/pyspark.context.SparkContext-class.html#textFile
  will
 set the parallelism on the read in SparkContext.textFile(), assuming I 
 have
 the cores in the cluster to deliver that level of parallelism. And if I
 don't explicitly provide it, Spark will set the minSplits to 2.

 So for example, say I have a cluster with 4 cores total, and it takes
 40 minutes to read a single file from S3 with minSplits at 2. Tt should
 take roughly 20 minutes to read the same file if I up minSplits to 4.

 Did I understand that correctly?

 RDD.saveAsTextFile() doesn't have an analog to minSplits, so I'm
 guessing that's not an operation the user can tune.


 On Mon, Mar 31, 2014 at 12:29 PM, Aaron Davidson 
 ilike...@gmail.comwrote:

 Spark will only use each core for one task at a time, so doing

 sc.textFile(s3 location, num reducers)

 where you set num reducers to at least as many as the total number
 of cores in your cluster, is about as fast you can get out of the box. 
 Same
 goes for saveAsTextFile.


 On Mon, Mar 31, 2014 at 8:49 AM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 Howdy-doody,

 I have a single, very large file sitting in S3 that I want to read
 in with sc.textFile(). What are the best practices for reading in this 
 file
 as quickly as possible? How do I parallelize the read as much as 
 possible?

 Similarly, say I have a single, very large RDD sitting in memory
 that I want to write out to S3 with RDD.saveAsTextFile(). What are the 
 best
 practices for writing this file out as quickly as possible?

 Nick


 --
 View this message in context: Best practices: Parallelized write to
 / read from 
 

Re: PySpark RDD.partitionBy() requires an RDD of tuples

2014-04-01 Thread Nicholas Chammas
Hmm, doing help(rdd) in PySpark doesn't show a method called repartition().
Trying rdd.repartition() or rdd.repartition(10) also fail. I'm on 0.9.0.

The approach I'm going with to partition my MappedRDD is to key it by a
random int, and then partition it.

So something like:

rdd = sc.textFile('s3n://gzipped_file_brah.gz') # rdd has 1 partition;
minSplits is not actionable due to gzip
keyed_rdd = rdd.keyBy(lambda x: randint(1,100)) # we key the RDD so we can
partition it
partitioned_rdd = keyed_rdd.partitionBy(10) # rdd has 10 partitions

Are you saying I don't have to do this?

Nick



On Tue, Apr 1, 2014 at 7:38 PM, Aaron Davidson ilike...@gmail.com wrote:

 Hm, yeah, the docs are not clear on this one. The function you're looking
 for to change the number of partitions on any ol' RDD is repartition(),
 which is available in master but for some reason doesn't seem to show up in
 the latest docs. Sorry about that, I also didn't realize partitionBy() had
 this behavior from reading the Python docs (though it is consistent with
 the Scala API, just more type-safe there).


 On Tue, Apr 1, 2014 at 3:01 PM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 Just an FYI, it's not obvious from the 
 docshttp://spark.incubator.apache.org/docs/latest/api/pyspark/pyspark.rdd.RDD-class.html#partitionBythat
  the following code should fail:

 a = sc.parallelize([1,2,3,4,5,6,7,8,9,10], 2)
 a._jrdd.splits().size()
 a.count()
 b = a.partitionBy(5)
 b._jrdd.splits().size()
 b.count()

 I figured out from the example that if I generated a key by doing this

 b = a.map(lambda x: (x, x)).partitionBy(5)

  then all would be well.

 In other words, partitionBy() only works on RDDs of tuples. Is that
 correct?

 Nick


 --
 View this message in context: PySpark RDD.partitionBy() requires an RDD
 of 
 tupleshttp://apache-spark-user-list.1001560.n3.nabble.com/PySpark-RDD-partitionBy-requires-an-RDD-of-tuples-tp3598.html
 Sent from the Apache Spark User List mailing list 
 archivehttp://apache-spark-user-list.1001560.n3.nabble.com/at Nabble.com.





Issue with zip and partitions

2014-04-01 Thread Patrick_Nicolas
Dell - Internal Use - Confidential
I got an exception can't zip RDDs with unusual numbers of Partitions when I 
apply any action (reduce, collect) of dataset created by zipping two dataset of 
10 million entries each.  The problem occurs independently of the number of 
partitions or when I let Spark creates those partitions.

Interestingly enough, I do not have problem zipping datasets of 1 and 2.5 
million entries.
A similar problem was reported on this board with 0.8 but remember if the 
problem was fixed.

Any idea? Any workaround?

I appreciate.


Re: Is there a way to get the current progress of the job?

2014-04-01 Thread Mayur Rustagi
You can get detailed information through Spark listener interface regarding
each stage. Multiple jobs may be compressed into a single stage so jobwise
information would be same as Spark.
Regards
Mayur

Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi



On Tue, Apr 1, 2014 at 11:18 AM, Kevin Markey kevin.mar...@oracle.comwrote:

  The discussion there hits on the distinction of jobs and stages.  When
 looking at one application, there are hundreds of stages, sometimes
 thousands.  Depends on the data and the task.  And the UI seems to track
 stages.  And one could independently track them for such a job.  But what
 if -- as occurs in another application -- there's only one or two stages,
 but lots of data passing through those 1 or 2 stages?

 Kevin Markey



 On 04/01/2014 09:55 AM, Mark Hamstra wrote:

 Some related discussion: https://github.com/apache/spark/pull/246


 On Tue, Apr 1, 2014 at 8:43 AM, Philip Ogren philip.og...@oracle.comwrote:

 Hi DB,

 Just wondering if you ever got an answer to your question about
 monitoring progress - either offline or through your own investigation.
  Any findings would be appreciated.

 Thanks,
 Philip


 On 01/30/2014 10:32 PM, DB Tsai wrote:

 Hi guys,

 When we're running a very long job, we would like to show users the
 current progress of map and reduce job. After looking at the api document,
 I don't find anything for this. However, in Spark UI, I could see the
 progress of the task. Is there anything I miss?

 Thanks.

 Sincerely,

 DB Tsai
 Machine Learning Engineer
 Alpine Data Labs
 --
 Web: http://alpinenow.com/







Re: Status of MLI?

2014-04-01 Thread Nan Zhu
mllib has been part of Spark distribution (under mllib directory), also check 
http://spark.apache.org/docs/latest/mllib-guide.html 

and for JIRA, because of the recent migration to apache JIRA, I think all 
mllib-related issues should be under the Spark umbrella, 
https://issues.apache.org/jira/browse/SPARK/?selectedTab=com.atlassian.jira.jira-projects-plugin:summary-panel
 

-- 
Nan Zhu


On Tuesday, April 1, 2014 at 10:38 PM, Krakna H wrote:

 What is the current development status of MLI/MLBase? I see that the github 
 repo is lying dormant (https://github.com/amplab/MLI) and JIRA has had no 
 activity in the last 30 days 
 (https://spark-project.atlassian.net/browse/MLI/?selectedTab=com.atlassian.jira.jira-projects-plugin:summary-panel).
  Is the plan to add a lot of this into mllib itself without needing a 
 separate API?
 
 Thanks!
 
 View this message in context: Status of MLI? 
 (http://apache-spark-user-list.1001560.n3.nabble.com/Status-of-MLI-tp3610.html)
 Sent from the Apache Spark User List mailing list archive 
 (http://apache-spark-user-list.1001560.n3.nabble.com/) at Nabble.com 
 (http://Nabble.com).



Re: Status of MLI?

2014-04-01 Thread Krakna H
Hi Nan,

I was actually referring to MLI/MLBase (http://www.mlbase.org); is this
being actively developed?

I'm familiar with mllib and have been looking at its documentation.

Thanks!


On Tue, Apr 1, 2014 at 10:44 PM, Nan Zhu [via Apache Spark User List] 
ml-node+s1001560n3611...@n3.nabble.com wrote:

  mllib has been part of Spark distribution (under mllib directory), also
 check http://spark.apache.org/docs/latest/mllib-guide.html

 and for JIRA, because of the recent migration to apache JIRA, I think all
 mllib-related issues should be under the Spark umbrella,
 https://issues.apache.org/jira/browse/SPARK/?selectedTab=com.atlassian.jira.jira-projects-plugin:summary-panel

 --
 Nan Zhu

 On Tuesday, April 1, 2014 at 10:38 PM, Krakna H wrote:

 What is the current development status of MLI/MLBase? I see that the
 github repo is lying dormant (https://github.com/amplab/MLI) and JIRA has
 had no activity in the last 30 days (
 https://spark-project.atlassian.net/browse/MLI/?selectedTab=com.atlassian.jira.jira-projects-plugin:summary-panel).
 Is the plan to add a lot of this into mllib itself without needing a
 separate API?

 Thanks!

 --
 View this message in context: Status of 
 MLI?http://apache-spark-user-list.1001560.n3.nabble.com/Status-of-MLI-tp3610.html
 Sent from the Apache Spark User List mailing list 
 archivehttp://apache-spark-user-list.1001560.n3.nabble.com/at
 Nabble.com.




 --
  If you reply to this email, your message will be added to the discussion
 below:

 http://apache-spark-user-list.1001560.n3.nabble.com/Status-of-MLI-tp3610p3611.html
  To start a new topic under Apache Spark User List, email
 ml-node+s1001560n1...@n3.nabble.com
 To unsubscribe from Apache Spark User List, click 
 herehttp://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=1code=c2hhbmthcmsrc3lzQGdtYWlsLmNvbXwxfDk3NjU5Mzg0
 .
 NAMLhttp://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Status-of-MLI-tp3610p3612.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Status of MLI?

2014-04-01 Thread Nan Zhu
Ah, I see, I’m sorry, I didn’t read your email carefully   

then I have no idea about the progress on MLBase

Best,  

--  
Nan Zhu


On Tuesday, April 1, 2014 at 11:05 PM, Krakna H wrote:

 Hi Nan,
  
 I was actually referring to MLI/MLBase (http://www.mlbase.org); is this being 
 actively developed?
  
 I'm familiar with mllib and have been looking at its documentation.
  
 Thanks!
  
  
 On Tue, Apr 1, 2014 at 10:44 PM, Nan Zhu [via Apache Spark User List] 
 [hidden email] (/user/SendEmail.jtp?type=nodenode=3612i=0) wrote:
  mllib has been part of Spark distribution (under mllib directory), also 
  check http://spark.apache.org/docs/latest/mllib-guide.html  
   
  and for JIRA, because of the recent migration to apache JIRA, I think all 
  mllib-related issues should be under the Spark umbrella, 
  https://issues.apache.org/jira/browse/SPARK/?selectedTab=com.atlassian.jira.jira-projects-plugin:summary-panel

   
  --  
  Nan Zhu
   
   
  On Tuesday, April 1, 2014 at 10:38 PM, Krakna H wrote:
   
   
   What is the current development status of MLI/MLBase? I see that the 
   github repo is lying dormant (https://github.com/amplab/MLI) and JIRA has 
   had no activity in the last 30 days 
   (https://spark-project.atlassian.net/browse/MLI/?selectedTab=com.atlassian.jira.jira-projects-plugin:summary-panel).
Is the plan to add a lot of this into mllib itself without needing a 
   separate API?

   Thanks!

   View this message in context: Status of MLI? 
   (http://apache-spark-user-list.1001560.n3.nabble.com/Status-of-MLI-tp3610.html)
   Sent from the Apache Spark User List mailing list archive 
   (http://apache-spark-user-list.1001560.n3.nabble.com/) at Nabble.com 
   (http://Nabble.com).
   
   
   
  If you reply to this email, your message will be added to the discussion 
  below: 
  http://apache-spark-user-list.1001560.n3.nabble.com/Status-of-MLI-tp3610p3611.html

  To start a new topic under Apache Spark User List, email [hidden email] 
  (/user/SendEmail.jtp?type=nodenode=3612i=1)  
  To unsubscribe from Apache Spark User List, click here.
  NAML 
  (http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml)

  
 View this message in context: Re: Status of MLI? 
 (http://apache-spark-user-list.1001560.n3.nabble.com/Status-of-MLI-tp3610p3612.html)
 Sent from the Apache Spark User List mailing list archive 
 (http://apache-spark-user-list.1001560.n3.nabble.com/) at Nabble.com 
 (http://Nabble.com).



Re: Status of MLI?

2014-04-01 Thread Evan R. Sparks
Hi there,

MLlib is the first component of MLbase - MLI and the higher levels of the
stack are still being developed. Look for updates in terms of our progress
on the hyperparameter tuning/model selection problem in the next month or
so!

- Evan


On Tue, Apr 1, 2014 at 8:05 PM, Krakna H shankark+...@gmail.com wrote:

 Hi Nan,

 I was actually referring to MLI/MLBase (http://www.mlbase.org); is this
 being actively developed?

 I'm familiar with mllib and have been looking at its documentation.

 Thanks!


 On Tue, Apr 1, 2014 at 10:44 PM, Nan Zhu [via Apache Spark User List] [hidden
 email] http://user/SendEmail.jtp?type=nodenode=3612i=0 wrote:

  mllib has been part of Spark distribution (under mllib directory), also
 check http://spark.apache.org/docs/latest/mllib-guide.html

 and for JIRA, because of the recent migration to apache JIRA, I think all
 mllib-related issues should be under the Spark umbrella,
 https://issues.apache.org/jira/browse/SPARK/?selectedTab=com.atlassian.jira.jira-projects-plugin:summary-panel

 --
 Nan Zhu

 On Tuesday, April 1, 2014 at 10:38 PM, Krakna H wrote:

 What is the current development status of MLI/MLBase? I see that the
 github repo is lying dormant (https://github.com/amplab/MLI) and JIRA
 has had no activity in the last 30 days (
 https://spark-project.atlassian.net/browse/MLI/?selectedTab=com.atlassian.jira.jira-projects-plugin:summary-panel).
 Is the plan to add a lot of this into mllib itself without needing a
 separate API?

 Thanks!

 --
 View this message in context: Status of 
 MLI?http://apache-spark-user-list.1001560.n3.nabble.com/Status-of-MLI-tp3610.html
 Sent from the Apache Spark User List mailing list 
 archivehttp://apache-spark-user-list.1001560.n3.nabble.com/at
 Nabble.com.




 --
  If you reply to this email, your message will be added to the
 discussion below:

 http://apache-spark-user-list.1001560.n3.nabble.com/Status-of-MLI-tp3610p3611.html
  To start a new topic under Apache Spark User List, email [hidden 
 email]http://user/SendEmail.jtp?type=nodenode=3612i=1
 To unsubscribe from Apache Spark User List, click here.
 NAMLhttp://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml



 --
 View this message in context: Re: Status of 
 MLI?http://apache-spark-user-list.1001560.n3.nabble.com/Status-of-MLI-tp3610p3612.html

 Sent from the Apache Spark User List mailing list 
 archivehttp://apache-spark-user-list.1001560.n3.nabble.com/at Nabble.com.