date:20140401

Re: How to index each map operation????

2014-04-01 Thread Shixiong Zhu

Hi Thierry,

Your code does not work if @yh18190 wants a global counter. A RDD may have
more than one partition. For each partition, cnt will be reset to -1. You
can try the following code:

scala> val rdd = sc.parallelize( (1, 'a') :: (2, 'b') :: (3, 'c') :: (4,
'd') :: Nil)
rdd: org.apache.spark.rdd.RDD[(Int, Char)] = ParallelCollectionRDD[3] at
parallelize at :12

scala> import org.apache.spark.HashPartitioner
import org.apache.spark.HashPartitioner

scala> val rdd2 = rdd.partitionBy(new HashPartitioner(2))
rdd2: org.apache.spark.rdd.RDD[(Int, Char)] = ShuffledRDD[4] at partitionBy
at :18

scala> var cnt = -1
cnt: Int = -1

scala> val rdd3 = rdd2.map(i => {cnt+=1;  (cnt,i)} )
rdd3: org.apache.spark.rdd.RDD[(Int, (Int, Char))] = MappedRDD[5] at map at
:22

scala> rdd3.collect
res2: Array[(Int, (Int, Char))] = Array((0,(2,b)), (1,(4,d)), (0,(1,a)),
(1,(3,c)))

A proper solution is using "rdd.partitionBy(new HashPartitioner(1))" to
make sure there is only one partition. But that's not efficient for big
input.

Best Regards,
Shixiong Zhu


2014-04-02 11:10 GMT+08:00 Thierry Herrmann :

> I'm new to Spark, but isn't this a pure scala question ?
>
> The following seems to work with the spark shell:
>
> $ spark-shell
>
> scala> val rdd = sc.makeRDD(List(10,20,30))
> rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[8] at makeRDD at
> :12
>
> scala> var cnt = -1
> cnt: Int = -1
>
> scala> val rdd2 = rdd.map(i => {cnt+=1;  (cnt,i)} )
> rdd2: org.apache.spark.rdd.RDD[(Int, Int)] = MappedRDD[9] at map at
> :16
>
> scala> rdd2.collect
> res8: Array[(Int, Int)] = Array((0,10), (1,20), (2,30))
>
> Thierry
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-index-each-map-operation-tp3471p3614.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>

Re: Using ProtoBuf 2.5 for messages with Spark Streaming

2014-04-01 Thread Patrick Wendell

It's this: mvn -Dhadoop.version=2.0.0-cdh4.2.1 -DskipTests clean package


On Tue, Apr 1, 2014 at 5:15 PM, Vipul Pandey  wrote:

> how do you recommend building that - it says
> ERROR] Failed to execute goal
> org.apache.maven.plugins:maven-assembly-plugin:2.2-beta-5:assembly
> (default-cli) on project spark-0.9.0-incubating: Error reading assemblies:
> No assembly descriptors found. -> [Help 1]
> upon runnning
> mvn -Dhadoop.version=2.0.0-cdh4.2.1 -DskipTests clean assembly:assembly
>
>
> On Apr 1, 2014, at 4:13 PM, Patrick Wendell  wrote:
>
> Do you get the same problem if you build with maven?
>
>
> On Tue, Apr 1, 2014 at 12:23 PM, Vipul Pandey  wrote:
>
>> SPARK_HADOOP_VERSION=2.0.0-cdh4.2.1 sbt/sbt assembly
>>
>> That's all I do.
>>
>> On Apr 1, 2014, at 11:41 AM, Patrick Wendell  wrote:
>>
>> Vidal - could you show exactly what flags/commands you are using when you
>> build spark to produce this assembly?
>>
>>
>> On Tue, Apr 1, 2014 at 12:53 AM, Vipul Pandey  wrote:
>>
>>> Spark now shades its own protobuf dependency so protobuf 2.4.1 should't
>>> be getting pulled in unless you are directly using akka yourself. Are you?
>>>
>>> No i'm not. Although I see that protobuf libraries are directly pulled
>>> into the 0.9.0 assembly jar - I do see the shaded version as well.
>>> e.g. below for Message.class
>>>
>>> -bash-4.1$ jar -ftv
>>> ./assembly/target/scala-2.10/spark-assembly-0.9.0-incubating-hadoop2.0.0-cdh4.2.1.jar
>>> | grep protobuf | grep /Message.class
>>>478 Thu Jun 30 15:26:12 PDT 2011 com/google/protobuf/Message.class
>>>508 Sat Dec 14 14:20:38 PST 2013
>>> com/google/protobuf_spark/Message.class
>>>
>>>
>>> Does your project have other dependencies that might be indirectly
>>> pulling in protobuf 2.4.1? It would be helpful if you could list all of
>>> your dependencies including the exact Spark version and other libraries.
>>>
>>> I did have another one which I moved to the end of classpath - even ran
>>> partial code without that dependency but it still failed whenever I use the
>>> jar with ScalaBuf dependency.
>>> Spark version is 0.9.0
>>>
>>>
>>> ~Vipul
>>>
>>> On Mar 31, 2014, at 4:51 PM, Patrick Wendell  wrote:
>>>
>>> Spark now shades its own protobuf dependency so protobuf 2.4.1 should't
>>> be getting pulled in unless you are directly using akka yourself. Are you?
>>>
>>> Does your project have other dependencies that might be indirectly
>>> pulling in protobuf 2.4.1? It would be helpful if you could list all of
>>> your dependencies including the exact Spark version and other libraries.
>>>
>>> - Patrick
>>>
>>>
>>> On Sun, Mar 30, 2014 at 10:03 PM, Vipul Pandey wrote:
>>>
 I'm using ScalaBuff (which depends on protobuf2.5) and facing the same
 issue. any word on this one?
 On Mar 27, 2014, at 6:41 PM, Kanwaldeep  wrote:

 > We are using Protocol Buffer 2.5 to send messages to Spark Streaming
 0.9 with
 > Kafka stream setup. I have protocol Buffer 2.5 part of the uber jar
 deployed
 > on each of the spark worker nodes.
 > The message is compiled using 2.5 but then on runtime it is being
 > de-serialized by 2.4.1 as I'm getting the following exception
 >
 > java.lang.VerifyError (java.lang.VerifyError: class
 > com.snc.sinet.messages.XServerMessage$XServer overrides final method
 > getUnknownFields.()Lcom/google/protobuf/UnknownFieldSet;)
 > java.lang.ClassLoader.defineClass1(Native Method)
 > java.lang.ClassLoader.defineClassCond(ClassLoader.java:631)
 > java.lang.ClassLoader.defineClass(ClassLoader.java:615)
 >
 java.security.SecureClassLoader.defineClass(SecureClassLoader.java:141)
 >
 > Suggestions on how I could still use ProtoBuf 2.5. Based on the
 article -
 > https://spark-project.atlassian.net/browse/SPARK-995 we should be
 able to
 > use different version of protobuf in the application.
 >
 >
 >
 >
 >
 > --
 > View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Using-ProtoBuf-2-5-for-messages-with-Spark-Streaming-tp3396.html
 > Sent from the Apache Spark User List mailing list archive at
 Nabble.com .


>>>
>>>
>>
>>
>
>

Re: Issue with zip and partitions

2014-04-01 Thread Xiangrui Meng

>From API docs: "Zips this RDD with another one, returning key-value
pairs with the first element in each RDD, second element in each RDD,
etc. Assumes that the two RDDs have the *same number of partitions*
and the *same number of elements in each partition* (e.g. one was made
through a map on the other)."

Basically, one RDD should be a mapped RDD of the other, or both RDDs
are mapped RDDs of the same RDD.

Btw, your message says "Dell - Internal Use - Confidential"...

Best,
Xiangrui

On Tue, Apr 1, 2014 at 7:27 PM,   wrote:
> Dell - Internal Use - Confidential
>
> I got an exception "can't zip RDDs with unusual numbers of Partitions" when
> I apply any action (reduce, collect) of dataset created by zipping two
> dataset of 10 million entries each.  The problem occurs independently of the
> number of partitions or when I let Spark creates those partitions.
>
>
>
> Interestingly enough, I do not have problem zipping datasets of 1 and 2.5
> million entries.
>
> A similar problem was reported on this board with 0.8 but remember if the
> problem was fixed.
>
>
>
> Any idea? Any workaround?
>
>
>
> I appreciate.

Re: Status of MLI?

2014-04-01 Thread Evan R. Sparks

Hi there,

MLlib is the first component of MLbase - MLI and the higher levels of the
stack are still being developed. Look for updates in terms of our progress
on the hyperparameter tuning/model selection problem in the next month or
so!

- Evan


On Tue, Apr 1, 2014 at 8:05 PM, Krakna H  wrote:

> Hi Nan,
>
> I was actually referring to MLI/MLBase (http://www.mlbase.org); is this
> being actively developed?
>
> I'm familiar with mllib and have been looking at its documentation.
>
> Thanks!
>
>
> On Tue, Apr 1, 2014 at 10:44 PM, Nan Zhu [via Apache Spark User List] <[hidden
> email] > wrote:
>
>>  mllib has been part of Spark distribution (under mllib directory), also
>> check http://spark.apache.org/docs/latest/mllib-guide.html
>>
>> and for JIRA, because of the recent migration to apache JIRA, I think all
>> mllib-related issues should be under the Spark umbrella,
>> https://issues.apache.org/jira/browse/SPARK/?selectedTab=com.atlassian.jira.jira-projects-plugin:summary-panel
>>
>> --
>> Nan Zhu
>>
>> On Tuesday, April 1, 2014 at 10:38 PM, Krakna H wrote:
>>
>> What is the current development status of MLI/MLBase? I see that the
>> github repo is lying dormant (https://github.com/amplab/MLI) and JIRA
>> has had no activity in the last 30 days (
>> https://spark-project.atlassian.net/browse/MLI/?selectedTab=com.atlassian.jira.jira-projects-plugin:summary-panel).
>> Is the plan to add a lot of this into mllib itself without needing a
>> separate API?
>>
>> Thanks!
>>
>> --
>> View this message in context: Status of 
>> MLI?
>> Sent from the Apache Spark User List mailing list 
>> archiveat
>> Nabble.com.
>>
>>
>>
>>
>> --
>>  If you reply to this email, your message will be added to the
>> discussion below:
>>
>> http://apache-spark-user-list.1001560.n3.nabble.com/Status-of-MLI-tp3610p3611.html
>>  To start a new topic under Apache Spark User List, email [hidden 
>> email]
>> To unsubscribe from Apache Spark User List, click here.
>> NAML
>>
>
>
> --
> View this message in context: Re: Status of 
> MLI?
>
> Sent from the Apache Spark User List mailing list 
> archiveat Nabble.com.
>

Re: How to index each map operation????

2014-04-01 Thread Thierry Herrmann

I'm new to Spark, but isn't this a pure scala question ?

The following seems to work with the spark shell:

$ spark-shell

scala> val rdd = sc.makeRDD(List(10,20,30))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[8] at makeRDD at
:12

scala> var cnt = -1
cnt: Int = -1

scala> val rdd2 = rdd.map(i => {cnt+=1;  (cnt,i)} )
rdd2: org.apache.spark.rdd.RDD[(Int, Int)] = MappedRDD[9] at map at
:16

scala> rdd2.collect
res8: Array[(Int, Int)] = Array((0,10), (1,20), (2,30))

Thierry



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-index-each-map-operation-tp3471p3614.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Status of MLI?

2014-04-01 Thread Nan Zhu

Ah, I see, I’m sorry, I didn’t read your email carefully   

then I have no idea about the progress on MLBase

Best,  

--  
Nan Zhu


On Tuesday, April 1, 2014 at 11:05 PM, Krakna H wrote:

> Hi Nan,
>  
> I was actually referring to MLI/MLBase (http://www.mlbase.org); is this being 
> actively developed?
>  
> I'm familiar with mllib and have been looking at its documentation.
>  
> Thanks!
>  
>  
> On Tue, Apr 1, 2014 at 10:44 PM, Nan Zhu [via Apache Spark User List] 
> <[hidden email] (/user/SendEmail.jtp?type=node&node=3612&i=0)> wrote:
> > mllib has been part of Spark distribution (under mllib directory), also 
> > check http://spark.apache.org/docs/latest/mllib-guide.html  
> >  
> > and for JIRA, because of the recent migration to apache JIRA, I think all 
> > mllib-related issues should be under the Spark umbrella, 
> > https://issues.apache.org/jira/browse/SPARK/?selectedTab=com.atlassian.jira.jira-projects-plugin:summary-panel
> >   
> >  
> > --  
> > Nan Zhu
> >  
> >  
> > On Tuesday, April 1, 2014 at 10:38 PM, Krakna H wrote:
> >  
> >  
> > > What is the current development status of MLI/MLBase? I see that the 
> > > github repo is lying dormant (https://github.com/amplab/MLI) and JIRA has 
> > > had no activity in the last 30 days 
> > > (https://spark-project.atlassian.net/browse/MLI/?selectedTab=com.atlassian.jira.jira-projects-plugin:summary-panel).
> > >  Is the plan to add a lot of this into mllib itself without needing a 
> > > separate API?
> > >  
> > > Thanks!
> > >  
> > > View this message in context: Status of MLI? 
> > > (http://apache-spark-user-list.1001560.n3.nabble.com/Status-of-MLI-tp3610.html)
> > > Sent from the Apache Spark User List mailing list archive 
> > > (http://apache-spark-user-list.1001560.n3.nabble.com/) at Nabble.com 
> > > (http://Nabble.com).
> >  
> >  
> >  
> > If you reply to this email, your message will be added to the discussion 
> > below: 
> > http://apache-spark-user-list.1001560.n3.nabble.com/Status-of-MLI-tp3610p3611.html
> >   
> > To start a new topic under Apache Spark User List, email [hidden email] 
> > (/user/SendEmail.jtp?type=node&node=3612&i=1)  
> > To unsubscribe from Apache Spark User List, click here.
> > NAML 
> > (http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml)
> >   
>  
> View this message in context: Re: Status of MLI? 
> (http://apache-spark-user-list.1001560.n3.nabble.com/Status-of-MLI-tp3610p3612.html)
> Sent from the Apache Spark User List mailing list archive 
> (http://apache-spark-user-list.1001560.n3.nabble.com/) at Nabble.com 
> (http://Nabble.com).

Re: Status of MLI?

2014-04-01 Thread Krakna H

Hi Nan,

I was actually referring to MLI/MLBase (http://www.mlbase.org); is this
being actively developed?

I'm familiar with mllib and have been looking at its documentation.

Thanks!


On Tue, Apr 1, 2014 at 10:44 PM, Nan Zhu [via Apache Spark User List] <
ml-node+s1001560n3611...@n3.nabble.com> wrote:

>  mllib has been part of Spark distribution (under mllib directory), also
> check http://spark.apache.org/docs/latest/mllib-guide.html
>
> and for JIRA, because of the recent migration to apache JIRA, I think all
> mllib-related issues should be under the Spark umbrella,
> https://issues.apache.org/jira/browse/SPARK/?selectedTab=com.atlassian.jira.jira-projects-plugin:summary-panel
>
> --
> Nan Zhu
>
> On Tuesday, April 1, 2014 at 10:38 PM, Krakna H wrote:
>
> What is the current development status of MLI/MLBase? I see that the
> github repo is lying dormant (https://github.com/amplab/MLI) and JIRA has
> had no activity in the last 30 days (
> https://spark-project.atlassian.net/browse/MLI/?selectedTab=com.atlassian.jira.jira-projects-plugin:summary-panel).
> Is the plan to add a lot of this into mllib itself without needing a
> separate API?
>
> Thanks!
>
> --
> View this message in context: Status of 
> MLI?
> Sent from the Apache Spark User List mailing list 
> archiveat
> Nabble.com.
>
>
>
>
> --
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/Status-of-MLI-tp3610p3611.html
>  To start a new topic under Apache Spark User List, email
> ml-node+s1001560n1...@n3.nabble.com
> To unsubscribe from Apache Spark User List, click 
> here
> .
> NAML
>




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Status-of-MLI-tp3610p3612.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Status of MLI?

2014-04-01 Thread Nan Zhu

mllib has been part of Spark distribution (under mllib directory), also check 
http://spark.apache.org/docs/latest/mllib-guide.html 

and for JIRA, because of the recent migration to apache JIRA, I think all 
mllib-related issues should be under the Spark umbrella, 
https://issues.apache.org/jira/browse/SPARK/?selectedTab=com.atlassian.jira.jira-projects-plugin:summary-panel
 

-- 
Nan Zhu


On Tuesday, April 1, 2014 at 10:38 PM, Krakna H wrote:

> What is the current development status of MLI/MLBase? I see that the github 
> repo is lying dormant (https://github.com/amplab/MLI) and JIRA has had no 
> activity in the last 30 days 
> (https://spark-project.atlassian.net/browse/MLI/?selectedTab=com.atlassian.jira.jira-projects-plugin:summary-panel).
>  Is the plan to add a lot of this into mllib itself without needing a 
> separate API?
> 
> Thanks!
> 
> View this message in context: Status of MLI? 
> (http://apache-spark-user-list.1001560.n3.nabble.com/Status-of-MLI-tp3610.html)
> Sent from the Apache Spark User List mailing list archive 
> (http://apache-spark-user-list.1001560.n3.nabble.com/) at Nabble.com 
> (http://Nabble.com).

Status of MLI?

2014-04-01 Thread Krakna H

What is the current development status of MLI/MLBase? I see that the github
repo is lying dormant (https://github.com/amplab/MLI) and JIRA has had no
activity in the last 30 days (
https://spark-project.atlassian.net/browse/MLI/?selectedTab=com.atlassian.jira.jira-projects-plugin:summary-panel).
Is the plan to add a lot of this into mllib itself without needing a
separate API?

Thanks!




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Status-of-MLI-tp3610.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Is there a way to get the current progress of the job?

2014-04-01 Thread Mayur Rustagi

You can get detailed information through Spark listener interface regarding
each stage. Multiple jobs may be compressed into a single stage so jobwise
information would be same as Spark.
Regards
Mayur

Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi 



On Tue, Apr 1, 2014 at 11:18 AM, Kevin Markey wrote:

>  The discussion there hits on the distinction of jobs and stages.  When
> looking at one application, there are hundreds of stages, sometimes
> thousands.  Depends on the data and the task.  And the UI seems to track
> stages.  And one could independently track them for such a job.  But what
> if -- as occurs in another application -- there's only one or two stages,
> but lots of data passing through those 1 or 2 stages?
>
> Kevin Markey
>
>
>
> On 04/01/2014 09:55 AM, Mark Hamstra wrote:
>
> Some related discussion: https://github.com/apache/spark/pull/246
>
>
> On Tue, Apr 1, 2014 at 8:43 AM, Philip Ogren wrote:
>
>> Hi DB,
>>
>> Just wondering if you ever got an answer to your question about
>> monitoring progress - either offline or through your own investigation.
>>  Any findings would be appreciated.
>>
>> Thanks,
>> Philip
>>
>>
>> On 01/30/2014 10:32 PM, DB Tsai wrote:
>>
>>> Hi guys,
>>>
>>> When we're running a very long job, we would like to show users the
>>> current progress of map and reduce job. After looking at the api document,
>>> I don't find anything for this. However, in Spark UI, I could see the
>>> progress of the task. Is there anything I miss?
>>>
>>> Thanks.
>>>
>>> Sincerely,
>>>
>>> DB Tsai
>>> Machine Learning Engineer
>>> Alpine Data Labs
>>> --
>>> Web: http://alpinenow.com/
>>>
>>
>>
>
>

Issue with zip and partitions

2014-04-01 Thread Patrick_Nicolas

Dell - Internal Use - Confidential
I got an exception "can't zip RDDs with unusual numbers of Partitions" when I 
apply any action (reduce, collect) of dataset created by zipping two dataset of 
10 million entries each.  The problem occurs independently of the number of 
partitions or when I let Spark creates those partitions.

Interestingly enough, I do not have problem zipping datasets of 1 and 2.5 
million entries.
A similar problem was reported on this board with 0.8 but remember if the 
problem was fixed.

Any idea? Any workaround?

I appreciate.

Re: Using ProtoBuf 2.5 for messages with Spark Streaming

2014-04-01 Thread Vipul Pandey

how do you recommend building that - it says 
ERROR] Failed to execute goal 
org.apache.maven.plugins:maven-assembly-plugin:2.2-beta-5:assembly 
(default-cli) on project spark-0.9.0-incubating: Error reading assemblies: No 
assembly descriptors found. -> [Help 1]
upon runnning 
mvn -Dhadoop.version=2.0.0-cdh4.2.1 -DskipTests clean assembly:assembly


On Apr 1, 2014, at 4:13 PM, Patrick Wendell  wrote:

> Do you get the same problem if you build with maven?
> 
> 
> On Tue, Apr 1, 2014 at 12:23 PM, Vipul Pandey  wrote:
> SPARK_HADOOP_VERSION=2.0.0-cdh4.2.1 sbt/sbt assembly 
> 
> That's all I do. 
> 
> On Apr 1, 2014, at 11:41 AM, Patrick Wendell  wrote:
> 
>> Vidal - could you show exactly what flags/commands you are using when you 
>> build spark to produce this assembly?
>> 
>> 
>> On Tue, Apr 1, 2014 at 12:53 AM, Vipul Pandey  wrote:
>>> Spark now shades its own protobuf dependency so protobuf 2.4.1 should't be 
>>> getting pulled in unless you are directly using akka yourself. Are you?
>> 
>> No i'm not. Although I see that protobuf libraries are directly pulled into 
>> the 0.9.0 assembly jar - I do see the shaded version as well. 
>> e.g. below for Message.class
>> 
>> -bash-4.1$ jar -ftv 
>> ./assembly/target/scala-2.10/spark-assembly-0.9.0-incubating-hadoop2.0.0-cdh4.2.1.jar
>>  | grep protobuf | grep /Message.class
>>478 Thu Jun 30 15:26:12 PDT 2011 com/google/protobuf/Message.class
>>508 Sat Dec 14 14:20:38 PST 2013 com/google/protobuf_spark/Message.class
>> 
>> 
>>> Does your project have other dependencies that might be indirectly pulling 
>>> in protobuf 2.4.1? It would be helpful if you could list all of your 
>>> dependencies including the exact Spark version and other libraries.
>> 
>> I did have another one which I moved to the end of classpath - even ran 
>> partial code without that dependency but it still failed whenever I use the 
>> jar with ScalaBuf dependency. 
>> Spark version is 0.9.0
>> 
>> 
>> ~Vipul
>> 
>> On Mar 31, 2014, at 4:51 PM, Patrick Wendell  wrote:
>> 
>>> Spark now shades its own protobuf dependency so protobuf 2.4.1 should't be 
>>> getting pulled in unless you are directly using akka yourself. Are you?
>>> 
>>> Does your project have other dependencies that might be indirectly pulling 
>>> in protobuf 2.4.1? It would be helpful if you could list all of your 
>>> dependencies including the exact Spark version and other libraries.
>>> 
>>> - Patrick
>>> 
>>> 
>>> On Sun, Mar 30, 2014 at 10:03 PM, Vipul Pandey  wrote:
>>> I'm using ScalaBuff (which depends on protobuf2.5) and facing the same 
>>> issue. any word on this one?
>>> On Mar 27, 2014, at 6:41 PM, Kanwaldeep  wrote:
>>> 
>>> > We are using Protocol Buffer 2.5 to send messages to Spark Streaming 0.9 
>>> > with
>>> > Kafka stream setup. I have protocol Buffer 2.5 part of the uber jar 
>>> > deployed
>>> > on each of the spark worker nodes.
>>> > The message is compiled using 2.5 but then on runtime it is being
>>> > de-serialized by 2.4.1 as I'm getting the following exception
>>> >
>>> > java.lang.VerifyError (java.lang.VerifyError: class
>>> > com.snc.sinet.messages.XServerMessage$XServer overrides final method
>>> > getUnknownFields.()Lcom/google/protobuf/UnknownFieldSet;)
>>> > java.lang.ClassLoader.defineClass1(Native Method)
>>> > java.lang.ClassLoader.defineClassCond(ClassLoader.java:631)
>>> > java.lang.ClassLoader.defineClass(ClassLoader.java:615)
>>> > java.security.SecureClassLoader.defineClass(SecureClassLoader.java:141)
>>> >
>>> > Suggestions on how I could still use ProtoBuf 2.5. Based on the article -
>>> > https://spark-project.atlassian.net/browse/SPARK-995 we should be able to
>>> > use different version of protobuf in the application.
>>> >
>>> >
>>> >
>>> >
>>> >
>>> > --
>>> > View this message in context: 
>>> > http://apache-spark-user-list.1001560.n3.nabble.com/Using-ProtoBuf-2-5-for-messages-with-Spark-Streaming-tp3396.html
>>> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>> 
>>> 
>> 
>> 
> 
>

Re: PySpark RDD.partitionBy() requires an RDD of tuples

2014-04-01 Thread Nicholas Chammas

Hmm, doing help(rdd) in PySpark doesn't show a method called repartition().
Trying rdd.repartition() or rdd.repartition(10) also fail. I'm on 0.9.0.

The approach I'm going with to partition my MappedRDD is to key it by a
random int, and then partition it.

So something like:

rdd = sc.textFile('s3n://gzipped_file_brah.gz') # rdd has 1 partition;
minSplits is not actionable due to gzip
keyed_rdd = rdd.keyBy(lambda x: randint(1,100)) # we key the RDD so we can
partition it
partitioned_rdd = keyed_rdd.partitionBy(10) # rdd has 10 partitions

Are you saying I don't have to do this?

Nick



On Tue, Apr 1, 2014 at 7:38 PM, Aaron Davidson  wrote:

> Hm, yeah, the docs are not clear on this one. The function you're looking
> for to change the number of partitions on any ol' RDD is "repartition()",
> which is available in master but for some reason doesn't seem to show up in
> the latest docs. Sorry about that, I also didn't realize partitionBy() had
> this behavior from reading the Python docs (though it is consistent with
> the Scala API, just more type-safe there).
>
>
> On Tue, Apr 1, 2014 at 3:01 PM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> Just an FYI, it's not obvious from the 
>> docsthat
>>  the following code should fail:
>>
>> a = sc.parallelize([1,2,3,4,5,6,7,8,9,10], 2)
>> a._jrdd.splits().size()
>> a.count()
>> b = a.partitionBy(5)
>> b._jrdd.splits().size()
>> b.count()
>>
>> I figured out from the example that if I generated a key by doing this
>>
>> b = a.map(lambda x: (x, x)).partitionBy(5)
>>
>>  then all would be well.
>>
>> In other words, partitionBy() only works on RDDs of tuples. Is that
>> correct?
>>
>> Nick
>>
>>
>> --
>> View this message in context: PySpark RDD.partitionBy() requires an RDD
>> of 
>> tuples
>> Sent from the Apache Spark User List mailing list 
>> archiveat Nabble.com.
>>
>
>

Re: Best practices: Parallelized write to / read from S3

2014-04-01 Thread Nicholas Chammas

Alright!

Thanks for that link. I did little research based on it and it looks like
Snappy or LZO + some container would be better alternatives to gzip.

I confirmed that gzip was cramping my style by trying sc.textFile() on an
uncompressed version of the text file. With the uncompressed file, setting
minSplits gives me an RDD that is partitioned as expected. This makes all
my subsequent operations, obviously, much faster.

I dunno if it would be appropriate to have Spark issue some kind of warning
that "Hey, your file is compressed using gzip so..."

Anyway, mystery solved!

Nick


On Tue, Apr 1, 2014 at 5:03 PM, Aaron Davidson  wrote:

> Looks like you're right that gzip files are not easily splittable [1], and
> also about everything else you said.
>
> [1]
> http://mail-archives.apache.org/mod_mbox/spark-user/201310.mbox/%3CCANDWdjY2hN-=jXTSNZ8JHZ=G-S+ZKLNze=rgkjacjaw3tto...@mail.gmail.com%3E
>
>
>
>
> On Tue, Apr 1, 2014 at 1:51 PM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> Alright, so I've upped the minSplits parameter on my call to textFile,
>> but the resulting RDD still has only 1 partition, which I assume means it
>> was read in on a single process. I am checking the number of partitions in
>> pyspark by using the rdd._jrdd.splits().size() trick I picked up on this
>> list.
>>
>> The source file is a gzipped text file. I have heard things about gzipped
>> files not being splittable.
>>
>> Is this the reason that opening the file with minSplits = 10 still gives
>> me an RDD with one partition? If so, I guess the only way to speed up the
>> load would be to change the source file's format to something splittable.
>>
>> Otherwise, if I want to speed up subsequent computation on the RDD, I
>> should explicitly partition it with a call to RDD.partitionBy(10).
>>
>> Is that correct?
>>
>>
>> On Mon, Mar 31, 2014 at 1:15 PM, Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>>> OK sweet. Thanks for walking me through that.
>>>
>>> I wish this were StackOverflow so I could bestow some nice rep on all
>>> you helpful people.
>>>
>>>
>>> On Mon, Mar 31, 2014 at 1:06 PM, Aaron Davidson wrote:
>>>
 Note that you may have minSplits set to more than the number of cores
 in the cluster, and Spark will just run as many as possible at a time. This
 is better if certain nodes may be slow, for instance.

 In general, it is not necessarily the case that doubling the number of
 cores doing IO will double the throughput, because you could be saturating
 the throughput with fewer cores. However, S3 is odd in that each connection
 gets way less bandwidth than your network link can provide, and it does
 seem to scale linearly with the number of connections. So, yes, taking
 minSplits up to 4 (or higher) will likely result in a 2x performance
 improvement.

 saveAsTextFile() will use as many partitions (aka splits) as the RDD
 it's being called on. So for instance:

 sc.textFile(myInputFile, 15).map(lambda x: x +
 "!!!").saveAsTextFile(myOutputFile)

 will use 15 partitions to read the text file (i.e., up to 15 cores at a
 time) and then again to save back to S3.



 On Mon, Mar 31, 2014 at 9:46 AM, Nicholas Chammas <
 nicholas.cham...@gmail.com> wrote:

> So setting 
> minSplits
>  will
> set the parallelism on the read in SparkContext.textFile(), assuming I 
> have
> the cores in the cluster to deliver that level of parallelism. And if I
> don't explicitly provide it, Spark will set the minSplits to 2.
>
> So for example, say I have a cluster with 4 cores total, and it takes
> 40 minutes to read a single file from S3 with minSplits at 2. Tt should
> take roughly 20 minutes to read the same file if I up minSplits to 4.
>
> Did I understand that correctly?
>
> RDD.saveAsTextFile() doesn't have an analog to minSplits, so I'm
> guessing that's not an operation the user can tune.
>
>
> On Mon, Mar 31, 2014 at 12:29 PM, Aaron Davidson 
> wrote:
>
>> Spark will only use each core for one task at a time, so doing
>>
>> sc.textFile(, )
>>
>> where you set "num reducers" to at least as many as the total number
>> of cores in your cluster, is about as fast you can get out of the box. 
>> Same
>> goes for saveAsTextFile.
>>
>>
>> On Mon, Mar 31, 2014 at 8:49 AM, Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>>> Howdy-doody,
>>>
>>> I have a single, very large file sitting in S3 that I want to read
>>> in with sc.textFile(). What are the best practices for reading in this 
>>> file
>>> as quickly as possible? How do I parallelize the read as much as 
>>> possible?
>>>
>>> Similarly, say I have a single, very large

Re: PySpark RDD.partitionBy() requires an RDD of tuples

2014-04-01 Thread Aaron Davidson

Hm, yeah, the docs are not clear on this one. The function you're looking
for to change the number of partitions on any ol' RDD is "repartition()",
which is available in master but for some reason doesn't seem to show up in
the latest docs. Sorry about that, I also didn't realize partitionBy() had
this behavior from reading the Python docs (though it is consistent with
the Scala API, just more type-safe there).

On Tue, Apr 1, 2014 at 3:01 PM, Nicholas Chammas  wrote:

> Just an FYI, it's not obvious from the 
> docsthat
>  the following code should fail:
>
> a = sc.parallelize([1,2,3,4,5,6,7,8,9,10], 2)
> a._jrdd.splits().size()
> a.count()
> b = a.partitionBy(5)
> b._jrdd.splits().size()
> b.count()
>
> I figured out from the example that if I generated a key by doing this
>
> b = a.map(lambda x: (x, x)).partitionBy(5)
>
>  then all would be well.
>
> In other words, partitionBy() only works on RDDs of tuples. Is that
> correct?
>
> Nick
>
>
> --
> View this message in context: PySpark RDD.partitionBy() requires an RDD
> of 
> tuples
> Sent from the Apache Spark User List mailing list 
> archiveat Nabble.com.
>

Re: Using ProtoBuf 2.5 for messages with Spark Streaming

2014-04-01 Thread Patrick Wendell

Do you get the same problem if you build with maven?


On Tue, Apr 1, 2014 at 12:23 PM, Vipul Pandey  wrote:

> SPARK_HADOOP_VERSION=2.0.0-cdh4.2.1 sbt/sbt assembly
>
> That's all I do.
>
> On Apr 1, 2014, at 11:41 AM, Patrick Wendell  wrote:
>
> Vidal - could you show exactly what flags/commands you are using when you
> build spark to produce this assembly?
>
>
> On Tue, Apr 1, 2014 at 12:53 AM, Vipul Pandey  wrote:
>
>> Spark now shades its own protobuf dependency so protobuf 2.4.1 should't
>> be getting pulled in unless you are directly using akka yourself. Are you?
>>
>> No i'm not. Although I see that protobuf libraries are directly pulled
>> into the 0.9.0 assembly jar - I do see the shaded version as well.
>> e.g. below for Message.class
>>
>> -bash-4.1$ jar -ftv
>> ./assembly/target/scala-2.10/spark-assembly-0.9.0-incubating-hadoop2.0.0-cdh4.2.1.jar
>> | grep protobuf | grep /Message.class
>>478 Thu Jun 30 15:26:12 PDT 2011 com/google/protobuf/Message.class
>>508 Sat Dec 14 14:20:38 PST 2013
>> com/google/protobuf_spark/Message.class
>>
>>
>> Does your project have other dependencies that might be indirectly
>> pulling in protobuf 2.4.1? It would be helpful if you could list all of
>> your dependencies including the exact Spark version and other libraries.
>>
>> I did have another one which I moved to the end of classpath - even ran
>> partial code without that dependency but it still failed whenever I use the
>> jar with ScalaBuf dependency.
>> Spark version is 0.9.0
>>
>>
>> ~Vipul
>>
>> On Mar 31, 2014, at 4:51 PM, Patrick Wendell  wrote:
>>
>> Spark now shades its own protobuf dependency so protobuf 2.4.1 should't
>> be getting pulled in unless you are directly using akka yourself. Are you?
>>
>> Does your project have other dependencies that might be indirectly
>> pulling in protobuf 2.4.1? It would be helpful if you could list all of
>> your dependencies including the exact Spark version and other libraries.
>>
>> - Patrick
>>
>>
>> On Sun, Mar 30, 2014 at 10:03 PM, Vipul Pandey wrote:
>>
>>> I'm using ScalaBuff (which depends on protobuf2.5) and facing the same
>>> issue. any word on this one?
>>> On Mar 27, 2014, at 6:41 PM, Kanwaldeep  wrote:
>>>
>>> > We are using Protocol Buffer 2.5 to send messages to Spark Streaming
>>> 0.9 with
>>> > Kafka stream setup. I have protocol Buffer 2.5 part of the uber jar
>>> deployed
>>> > on each of the spark worker nodes.
>>> > The message is compiled using 2.5 but then on runtime it is being
>>> > de-serialized by 2.4.1 as I'm getting the following exception
>>> >
>>> > java.lang.VerifyError (java.lang.VerifyError: class
>>> > com.snc.sinet.messages.XServerMessage$XServer overrides final method
>>> > getUnknownFields.()Lcom/google/protobuf/UnknownFieldSet;)
>>> > java.lang.ClassLoader.defineClass1(Native Method)
>>> > java.lang.ClassLoader.defineClassCond(ClassLoader.java:631)
>>> > java.lang.ClassLoader.defineClass(ClassLoader.java:615)
>>> > java.security.SecureClassLoader.defineClass(SecureClassLoader.java:141)
>>> >
>>> > Suggestions on how I could still use ProtoBuf 2.5. Based on the
>>> article -
>>> > https://spark-project.atlassian.net/browse/SPARK-995 we should be
>>> able to
>>> > use different version of protobuf in the application.
>>> >
>>> >
>>> >
>>> >
>>> >
>>> > --
>>> > View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Using-ProtoBuf-2-5-for-messages-with-Spark-Streaming-tp3396.html
>>> > Sent from the Apache Spark User List mailing list archive at
>>> Nabble.com .
>>>
>>>
>>
>>
>
>

Re: Cannot Access Web UI

2014-04-01 Thread Nicholas Chammas

Make that

lynx localhost:8080

to isolate any network-related problems.


On Tue, Apr 1, 2014 at 6:58 PM, Nicholas Chammas  wrote:

> Are you trying to access the UI from another machine? If so, first confirm
> that you don't have a network issue by opening the UI from the master node
> itself.
>
> For example:
>
> yum -y install lynx
> lynx ip_address:8080
>
> If this succeeds, then you likely have something blocking you from
> accessing the web page from another machine.
>
> Nick
>
>
>
> On Tue, Apr 1, 2014 at 6:30 PM, yxzhao  wrote:
> >
> >
> http://spark.incubator.apache.org/docs/latest/spark-standalone.html#monitoring-and-logging
> > As the above shows:
> > "
> > Monitoring and Logging
> > Spark's standalone mode offers a web-based user interface to monitor the
> > cluster. The master and each worker has its own web UI that shows cluster
> > and job statistics. By default you can access the web UI for the master
> at
> > port 8080. The port can be changed either in the configuration file or
> via
> > command-line options.
> >
> > "
> > But I cannot open the web ui. The master ip is 10.1.255.202 so I input:
> > http://10.1.255.202 :8080 in my web browser. Bui the webpage is not
> > available.
> > Thanks.
> >
> >
> >
> > --
> > View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Cannot-Access-Web-UI-tp3599.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
>

Re: Cannot Access Web UI

2014-04-01 Thread Nicholas Chammas

Are you trying to access the UI from another machine? If so, first confirm
that you don't have a network issue by opening the UI from the master node
itself.

For example:

yum -y install lynx
lynx ip_address:8080

If this succeeds, then you likely have something blocking you from
accessing the web page from another machine.

Nick


On Tue, Apr 1, 2014 at 6:30 PM, yxzhao  wrote:
>
>
http://spark.incubator.apache.org/docs/latest/spark-standalone.html#monitoring-and-logging
> As the above shows:
> "
> Monitoring and Logging
> Spark's standalone mode offers a web-based user interface to monitor the
> cluster. The master and each worker has its own web UI that shows cluster
> and job statistics. By default you can access the web UI for the master at
> port 8080. The port can be changed either in the configuration file or via
> command-line options.
>
> "
> But I cannot open the web ui. The master ip is 10.1.255.202 so I input:
> http://10.1.255.202 :8080 in my web browser. Bui the webpage is not
> available.
> Thanks.
>
>
>
> --
> View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Cannot-Access-Web-UI-tp3599.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.

Cannot Access Web UI

2014-04-01 Thread yxzhao

http://spark.incubator.apache.org/docs/latest/spark-standalone.html#monitoring-and-logging
As the above shows:
"
Monitoring and Logging
Spark’s standalone mode offers a web-based user interface to monitor the
cluster. The master and each worker has its own web UI that shows cluster
and job statistics. By default you can access the web UI for the master at
port 8080. The port can be changed either in the configuration file or via
command-line options.

"
But I cannot open the web ui. The master ip is 10.1.255.202 so I input:
http://10.1.255.202 :8080 in my web browser. Bui the webpage is not
available.
Thanks.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Cannot-Access-Web-UI-tp3599.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

PySpark RDD.partitionBy() requires an RDD of tuples

2014-04-01 Thread Nicholas Chammas

Just an FYI, it's not obvious from the
docsthat
the following code should fail:

a = sc.parallelize([1,2,3,4,5,6,7,8,9,10], 2)
a._jrdd.splits().size()
a.count()
b = a.partitionBy(5)
b._jrdd.splits().size()
b.count()

I figured out from the example that if I generated a key by doing this

b = a.map(lambda x: (x, x)).partitionBy(5)

then all would be well.

In other words, partitionBy() only works on RDDs of tuples. Is that correct?

Nick




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-RDD-partitionBy-requires-an-RDD-of-tuples-tp3598.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Generic types and pair RDDs

2014-04-01 Thread Daniel Siegmann

That worked, thank you both! Thanks also Aaron for the list of things I
need to read up on - I hadn't heard of ClassTag before.


On Tue, Apr 1, 2014 at 5:10 PM, Aaron Davidson  wrote:

> Koert's answer is very likely correct. This implicit definition which
> converts an RDD[(K, V)] to provide PairRDDFunctions requires a ClassTag is
> available for K:
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L1124
>
> To fully understand what's going on from a Scala beginner's point of view,
> you'll have to look up ClassTags, context bounds (the "K : ClassTag"
> syntax), and implicit functions. Fortunately, you don't have to understand
> monads...
>
>
> On Tue, Apr 1, 2014 at 2:06 PM, Koert Kuipers  wrote:
>
>>   import org.apache.spark.SparkContext._
>>   import org.apache.spark.rdd.RDD
>>   import scala.reflect.ClassTag
>>
>>   def joinTest[K: ClassTag](rddA: RDD[(K, Int)], rddB: RDD[(K, Int)]) :
>> RDD[(K, Int)] = {
>>
>> rddA.join(rddB).map { case (k, (a, b)) => (k, a+b) }
>>   }
>>
>>
>> On Tue, Apr 1, 2014 at 4:55 PM, Daniel Siegmann > > wrote:
>>
>>> When my tuple type includes a generic type parameter, the pair RDD
>>> functions aren't available. Take for example the following (a join on two
>>> RDDs, taking the sum of the values):
>>>
>>> def joinTest(rddA: RDD[(String, Int)], rddB: RDD[(String, Int)]) :
>>> RDD[(String, Int)] = {
>>> rddA.join(rddB).map { case (k, (a, b)) => (k, a+b) }
>>> }
>>>
>>> That works fine, but lets say I replace the type of the key with a
>>> generic type:
>>>
>>> def joinTest[K](rddA: RDD[(K, Int)], rddB: RDD[(K, Int)]) : RDD[(K,
>>> Int)] = {
>>> rddA.join(rddB).map { case (k, (a, b)) => (k, a+b) }
>>> }
>>>
>>> This latter function gets the compiler error "value join is not a member
>>> of org.apache.spark.rdd.RDD[(K, Int)]".
>>>
>>> The reason is probably obvious, but I don't have much Scala experience.
>>> Can anyone explain what I'm doing wrong?
>>>
>>> --
>>> Daniel Siegmann, Software Developer
>>> Velos
>>> Accelerating Machine Learning
>>>
>>> 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
>>> E: daniel.siegm...@velos.io W: www.velos.io
>>>
>>
>>
>


-- 
Daniel Siegmann, Software Developer
Velos
Accelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
E: daniel.siegm...@velos.io W: www.velos.io

Re: Generic types and pair RDDs

2014-04-01 Thread Aaron Davidson

Koert's answer is very likely correct. This implicit definition which
converts an RDD[(K, V)] to provide PairRDDFunctions requires a ClassTag is
available for K:
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L1124

To fully understand what's going on from a Scala beginner's point of view,
you'll have to look up ClassTags, context bounds (the "K : ClassTag"
syntax), and implicit functions. Fortunately, you don't have to understand
monads...


On Tue, Apr 1, 2014 at 2:06 PM, Koert Kuipers  wrote:

>   import org.apache.spark.SparkContext._
>   import org.apache.spark.rdd.RDD
>   import scala.reflect.ClassTag
>
>   def joinTest[K: ClassTag](rddA: RDD[(K, Int)], rddB: RDD[(K, Int)]) :
> RDD[(K, Int)] = {
>
> rddA.join(rddB).map { case (k, (a, b)) => (k, a+b) }
>   }
>
>
> On Tue, Apr 1, 2014 at 4:55 PM, Daniel Siegmann 
> wrote:
>
>> When my tuple type includes a generic type parameter, the pair RDD
>> functions aren't available. Take for example the following (a join on two
>> RDDs, taking the sum of the values):
>>
>> def joinTest(rddA: RDD[(String, Int)], rddB: RDD[(String, Int)]) :
>> RDD[(String, Int)] = {
>> rddA.join(rddB).map { case (k, (a, b)) => (k, a+b) }
>> }
>>
>> That works fine, but lets say I replace the type of the key with a
>> generic type:
>>
>> def joinTest[K](rddA: RDD[(K, Int)], rddB: RDD[(K, Int)]) : RDD[(K, Int)]
>> = {
>> rddA.join(rddB).map { case (k, (a, b)) => (k, a+b) }
>> }
>>
>> This latter function gets the compiler error "value join is not a member
>> of org.apache.spark.rdd.RDD[(K, Int)]".
>>
>> The reason is probably obvious, but I don't have much Scala experience.
>> Can anyone explain what I'm doing wrong?
>>
>> --
>> Daniel Siegmann, Software Developer
>> Velos
>> Accelerating Machine Learning
>>
>> 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
>> E: daniel.siegm...@velos.io W: www.velos.io
>>
>
>

Re: Generic types and pair RDDs

2014-04-01 Thread Koert Kuipers

  import org.apache.spark.SparkContext._
  import org.apache.spark.rdd.RDD
  import scala.reflect.ClassTag

  def joinTest[K: ClassTag](rddA: RDD[(K, Int)], rddB: RDD[(K, Int)]) :
RDD[(K, Int)] = {
rddA.join(rddB).map { case (k, (a, b)) => (k, a+b) }
  }


On Tue, Apr 1, 2014 at 4:55 PM, Daniel Siegmann wrote:

> When my tuple type includes a generic type parameter, the pair RDD
> functions aren't available. Take for example the following (a join on two
> RDDs, taking the sum of the values):
>
> def joinTest(rddA: RDD[(String, Int)], rddB: RDD[(String, Int)]) :
> RDD[(String, Int)] = {
> rddA.join(rddB).map { case (k, (a, b)) => (k, a+b) }
> }
>
> That works fine, but lets say I replace the type of the key with a generic
> type:
>
> def joinTest[K](rddA: RDD[(K, Int)], rddB: RDD[(K, Int)]) : RDD[(K, Int)]
> = {
> rddA.join(rddB).map { case (k, (a, b)) => (k, a+b) }
> }
>
> This latter function gets the compiler error "value join is not a member
> of org.apache.spark.rdd.RDD[(K, Int)]".
>
> The reason is probably obvious, but I don't have much Scala experience.
> Can anyone explain what I'm doing wrong?
>
> --
> Daniel Siegmann, Software Developer
> Velos
> Accelerating Machine Learning
>
> 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
> E: daniel.siegm...@velos.io W: www.velos.io
>

Re: Best practices: Parallelized write to / read from S3

2014-04-01 Thread Aaron Davidson

Looks like you're right that gzip files are not easily splittable [1], and
also about everything else you said.

[1]
http://mail-archives.apache.org/mod_mbox/spark-user/201310.mbox/%3CCANDWdjY2hN-=jXTSNZ8JHZ=G-S+ZKLNze=rgkjacjaw3tto...@mail.gmail.com%3E




On Tue, Apr 1, 2014 at 1:51 PM, Nicholas Chammas  wrote:

> Alright, so I've upped the minSplits parameter on my call to textFile, but
> the resulting RDD still has only 1 partition, which I assume means it was
> read in on a single process. I am checking the number of partitions in
> pyspark by using the rdd._jrdd.splits().size() trick I picked up on this
> list.
>
> The source file is a gzipped text file. I have heard things about gzipped
> files not being splittable.
>
> Is this the reason that opening the file with minSplits = 10 still gives
> me an RDD with one partition? If so, I guess the only way to speed up the
> load would be to change the source file's format to something splittable.
>
> Otherwise, if I want to speed up subsequent computation on the RDD, I
> should explicitly partition it with a call to RDD.partitionBy(10).
>
> Is that correct?
>
>
> On Mon, Mar 31, 2014 at 1:15 PM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> OK sweet. Thanks for walking me through that.
>>
>> I wish this were StackOverflow so I could bestow some nice rep on all you
>> helpful people.
>>
>>
>> On Mon, Mar 31, 2014 at 1:06 PM, Aaron Davidson wrote:
>>
>>> Note that you may have minSplits set to more than the number of cores in
>>> the cluster, and Spark will just run as many as possible at a time. This is
>>> better if certain nodes may be slow, for instance.
>>>
>>> In general, it is not necessarily the case that doubling the number of
>>> cores doing IO will double the throughput, because you could be saturating
>>> the throughput with fewer cores. However, S3 is odd in that each connection
>>> gets way less bandwidth than your network link can provide, and it does
>>> seem to scale linearly with the number of connections. So, yes, taking
>>> minSplits up to 4 (or higher) will likely result in a 2x performance
>>> improvement.
>>>
>>> saveAsTextFile() will use as many partitions (aka splits) as the RDD
>>> it's being called on. So for instance:
>>>
>>> sc.textFile(myInputFile, 15).map(lambda x: x +
>>> "!!!").saveAsTextFile(myOutputFile)
>>>
>>> will use 15 partitions to read the text file (i.e., up to 15 cores at a
>>> time) and then again to save back to S3.
>>>
>>>
>>>
>>> On Mon, Mar 31, 2014 at 9:46 AM, Nicholas Chammas <
>>> nicholas.cham...@gmail.com> wrote:
>>>
 So setting 
 minSplits
  will
 set the parallelism on the read in SparkContext.textFile(), assuming I have
 the cores in the cluster to deliver that level of parallelism. And if I
 don't explicitly provide it, Spark will set the minSplits to 2.

 So for example, say I have a cluster with 4 cores total, and it takes
 40 minutes to read a single file from S3 with minSplits at 2. Tt should
 take roughly 20 minutes to read the same file if I up minSplits to 4.

 Did I understand that correctly?

 RDD.saveAsTextFile() doesn't have an analog to minSplits, so I'm
 guessing that's not an operation the user can tune.


 On Mon, Mar 31, 2014 at 12:29 PM, Aaron Davidson wrote:

> Spark will only use each core for one task at a time, so doing
>
> sc.textFile(, )
>
> where you set "num reducers" to at least as many as the total number
> of cores in your cluster, is about as fast you can get out of the box. 
> Same
> goes for saveAsTextFile.
>
>
> On Mon, Mar 31, 2014 at 8:49 AM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> Howdy-doody,
>>
>> I have a single, very large file sitting in S3 that I want to read in
>> with sc.textFile(). What are the best practices for reading in this file 
>> as
>> quickly as possible? How do I parallelize the read as much as possible?
>>
>> Similarly, say I have a single, very large RDD sitting in memory that
>> I want to write out to S3 with RDD.saveAsTextFile(). What are the best
>> practices for writing this file out as quickly as possible?
>>
>> Nick
>>
>>
>> --
>> View this message in context: Best practices: Parallelized write to
>> / read from 
>> S3
>> Sent from the Apache Spark User List mailing list 
>> archiveat 
>> Nabble.com.
>>
>
>

>>>
>>
>

Generic types and pair RDDs

2014-04-01 Thread Daniel Siegmann

When my tuple type includes a generic type parameter, the pair RDD
functions aren't available. Take for example the following (a join on two
RDDs, taking the sum of the values):

def joinTest(rddA: RDD[(String, Int)], rddB: RDD[(String, Int)]) :
RDD[(String, Int)] = {
rddA.join(rddB).map { case (k, (a, b)) => (k, a+b) }
}

That works fine, but lets say I replace the type of the key with a generic
type:

def joinTest[K](rddA: RDD[(K, Int)], rddB: RDD[(K, Int)]) : RDD[(K, Int)] =
{
rddA.join(rddB).map { case (k, (a, b)) => (k, a+b) }
}

This latter function gets the compiler error "value join is not a member of
org.apache.spark.rdd.RDD[(K, Int)]".

The reason is probably obvious, but I don't have much Scala experience. Can
anyone explain what I'm doing wrong?

-- 
Daniel Siegmann, Software Developer
Velos
Accelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
E: daniel.siegm...@velos.io W: www.velos.io

Re: Best practices: Parallelized write to / read from S3

2014-04-01 Thread Nicholas Chammas

Alright, so I've upped the minSplits parameter on my call to textFile, but
the resulting RDD still has only 1 partition, which I assume means it was
read in on a single process. I am checking the number of partitions in
pyspark by using the rdd._jrdd.splits().size() trick I picked up on this
list.

The source file is a gzipped text file. I have heard things about gzipped
files not being splittable.

Is this the reason that opening the file with minSplits = 10 still gives me
an RDD with one partition? If so, I guess the only way to speed up the load
would be to change the source file's format to something splittable.

Otherwise, if I want to speed up subsequent computation on the RDD, I
should explicitly partition it with a call to RDD.partitionBy(10).

Is that correct?


On Mon, Mar 31, 2014 at 1:15 PM, Nicholas Chammas <
nicholas.cham...@gmail.com> wrote:

> OK sweet. Thanks for walking me through that.
>
> I wish this were StackOverflow so I could bestow some nice rep on all you
> helpful people.
>
>
> On Mon, Mar 31, 2014 at 1:06 PM, Aaron Davidson wrote:
>
>> Note that you may have minSplits set to more than the number of cores in
>> the cluster, and Spark will just run as many as possible at a time. This is
>> better if certain nodes may be slow, for instance.
>>
>> In general, it is not necessarily the case that doubling the number of
>> cores doing IO will double the throughput, because you could be saturating
>> the throughput with fewer cores. However, S3 is odd in that each connection
>> gets way less bandwidth than your network link can provide, and it does
>> seem to scale linearly with the number of connections. So, yes, taking
>> minSplits up to 4 (or higher) will likely result in a 2x performance
>> improvement.
>>
>> saveAsTextFile() will use as many partitions (aka splits) as the RDD it's
>> being called on. So for instance:
>>
>> sc.textFile(myInputFile, 15).map(lambda x: x +
>> "!!!").saveAsTextFile(myOutputFile)
>>
>> will use 15 partitions to read the text file (i.e., up to 15 cores at a
>> time) and then again to save back to S3.
>>
>>
>>
>> On Mon, Mar 31, 2014 at 9:46 AM, Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>>> So setting 
>>> minSplits
>>>  will
>>> set the parallelism on the read in SparkContext.textFile(), assuming I have
>>> the cores in the cluster to deliver that level of parallelism. And if I
>>> don't explicitly provide it, Spark will set the minSplits to 2.
>>>
>>> So for example, say I have a cluster with 4 cores total, and it takes 40
>>> minutes to read a single file from S3 with minSplits at 2. Tt should take
>>> roughly 20 minutes to read the same file if I up minSplits to 4.
>>>
>>> Did I understand that correctly?
>>>
>>> RDD.saveAsTextFile() doesn't have an analog to minSplits, so I'm
>>> guessing that's not an operation the user can tune.
>>>
>>>
>>> On Mon, Mar 31, 2014 at 12:29 PM, Aaron Davidson wrote:
>>>
 Spark will only use each core for one task at a time, so doing

 sc.textFile(, )

 where you set "num reducers" to at least as many as the total number of
 cores in your cluster, is about as fast you can get out of the box. Same
 goes for saveAsTextFile.


 On Mon, Mar 31, 2014 at 8:49 AM, Nicholas Chammas <
 nicholas.cham...@gmail.com> wrote:

> Howdy-doody,
>
> I have a single, very large file sitting in S3 that I want to read in
> with sc.textFile(). What are the best practices for reading in this file 
> as
> quickly as possible? How do I parallelize the read as much as possible?
>
> Similarly, say I have a single, very large RDD sitting in memory that
> I want to write out to S3 with RDD.saveAsTextFile(). What are the best
> practices for writing this file out as quickly as possible?
>
> Nick
>
>
> --
> View this message in context: Best practices: Parallelized write to /
> read from 
> S3
> Sent from the Apache Spark User List mailing list 
> archiveat 
> Nabble.com.
>


>>>
>>
>

Protobuf 2.5 Mesos

2014-04-01 Thread Ian Ferreira

>From what I can tell I need to use mesos 0-17 to support protobuf 2.5
which is required for hadoop 2.3.0. However I still run into the JVM
error which appears to be related to protobuf compatibility. Any
recommendations?

Re: custom receiver in java

2014-04-01 Thread Tathagata Das

Unfortunately, there isnt a good Java-friendly way to define custom
receivers. However, I am currently refactoring the receiver interface to
make it more Java friendly and I hope to get that in Spark 1.0 release.

In the meantime, I would encourage you define the custom receiver in Scala.
If you are not comfortable with Scala, we can write all the core
functionality in Java as functions, and create a skeleton Scala custom
receiver that calls into the Java functions (shown below). Unfortunately,
you have to use scala compiler to compile this.

TD

---
  class CustomReceiver(parameters: CustomReceiverParameters)
 extends NetworkReceiver[**]
   {
 protected lazy val blocksGenerator: BlockGenerator =
   new BlockGenerator(StorageLevel.MEMORY_ONLY_SER_2)
* val javaReceiver = new MyJavaReceiver(parameters, blockGenerator)
 // create an instance of java receiver object*

 protected def onStart() = {
   blocksGenerator.start()
*   javaReceiver.start()   // start the java receiver*
 }

 protected def onStop() {
   blocksGenerator.stop()
*   javaReceiver.stop()*
 }
   }

On Tue, Apr 1, 2014 at 7:54 AM, eric perler  wrote:

> i would like to write a custom receiver to receive data from a Tibco RV
> subject
>
> i found this scala example..
>
>
> http://spark.incubator.apache.org/docs/0.8.0/streaming-custom-receivers.html
>
> but i cant seem to find a java example
>
> does anybody know of a good java example for creating a custom receiver
>
> thx
>

Re: possible bug in Spark's ALS implementation...

2014-04-01 Thread Nick Pentreath

Hi Michael

Would you mind setting out exactly what differences you did find between
the Spark and Oryx implementations? Would be good to be clear on them, and
also see if there are further tricks/enhancements from the Oryx one that
can be ported (such as the lambda * numRatings adjustment).

N


On Sat, Mar 15, 2014 at 2:52 AM, Michael Allman  wrote:

> I've been thoroughly investigating this issue over the past couple of days
> and have discovered quite a bit. For one thing, there is definitely (at
> least) one issue/bug in the Spark implementation that leads to incorrect
> results for models generated with rank > 1 or a large number of iterations.
> I will post a bug report with a thorough explanation this weekend or on
> Monday.
>
> I believe I've been able to track down every difference between the Spark
> and Oryx implementations that lead to difference results. I made some
> adjustments to the spark implementation so that, given the same initial
> product/item vectors, the resulting model is identical to the one produced
> by Oryx within a small numerical tolerance. I've verified this for small
> data sets and am working on verifying this with some large data sets.
>
> Aside from those already identified in this thread, another significant
> difference in the Spark implementation is that it begins the factorization
> process by computing the product matrix (Y) from the initial user matrix
> (X). Both of the papers on ALS referred to in this thread begin the process
> by computing the user matrix. I haven't done any testing comparing the
> models generated starting from Y or X, but they are very different. Is
> there
> a reason Spark begins the iteration by computing Y?
>
> Initializing both X and Y as is done in the Spark implementation seems
> unnecessary unless I'm overlooking some desired side-effect. Only the
> factor
> matrix which generates the other in the first iteration needs to be
> initialized.
>
> I also found that the product and user RDDs were being rebuilt many times
> over in my tests, even for tiny data sets. By persisting the RDD returned
> from updateFeatures() I was able to avoid a raft of duplicate computations.
> Is there a reason not to do this?
>
> Thanks.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/possible-bug-in-Spark-s-ALS-implementation-tp2567p2704.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>

Re: Using ProtoBuf 2.5 for messages with Spark Streaming

2014-04-01 Thread Vipul Pandey

SPARK_HADOOP_VERSION=2.0.0-cdh4.2.1 sbt/sbt assembly 

That's all I do. 

On Apr 1, 2014, at 11:41 AM, Patrick Wendell  wrote:

> Vidal - could you show exactly what flags/commands you are using when you 
> build spark to produce this assembly?
> 
> 
> On Tue, Apr 1, 2014 at 12:53 AM, Vipul Pandey  wrote:
>> Spark now shades its own protobuf dependency so protobuf 2.4.1 should't be 
>> getting pulled in unless you are directly using akka yourself. Are you?
> 
> No i'm not. Although I see that protobuf libraries are directly pulled into 
> the 0.9.0 assembly jar - I do see the shaded version as well. 
> e.g. below for Message.class
> 
> -bash-4.1$ jar -ftv 
> ./assembly/target/scala-2.10/spark-assembly-0.9.0-incubating-hadoop2.0.0-cdh4.2.1.jar
>  | grep protobuf | grep /Message.class
>478 Thu Jun 30 15:26:12 PDT 2011 com/google/protobuf/Message.class
>508 Sat Dec 14 14:20:38 PST 2013 com/google/protobuf_spark/Message.class
> 
> 
>> Does your project have other dependencies that might be indirectly pulling 
>> in protobuf 2.4.1? It would be helpful if you could list all of your 
>> dependencies including the exact Spark version and other libraries.
> 
> I did have another one which I moved to the end of classpath - even ran 
> partial code without that dependency but it still failed whenever I use the 
> jar with ScalaBuf dependency. 
> Spark version is 0.9.0
> 
> 
> ~Vipul
> 
> On Mar 31, 2014, at 4:51 PM, Patrick Wendell  wrote:
> 
>> Spark now shades its own protobuf dependency so protobuf 2.4.1 should't be 
>> getting pulled in unless you are directly using akka yourself. Are you?
>> 
>> Does your project have other dependencies that might be indirectly pulling 
>> in protobuf 2.4.1? It would be helpful if you could list all of your 
>> dependencies including the exact Spark version and other libraries.
>> 
>> - Patrick
>> 
>> 
>> On Sun, Mar 30, 2014 at 10:03 PM, Vipul Pandey  wrote:
>> I'm using ScalaBuff (which depends on protobuf2.5) and facing the same 
>> issue. any word on this one?
>> On Mar 27, 2014, at 6:41 PM, Kanwaldeep  wrote:
>> 
>> > We are using Protocol Buffer 2.5 to send messages to Spark Streaming 0.9 
>> > with
>> > Kafka stream setup. I have protocol Buffer 2.5 part of the uber jar 
>> > deployed
>> > on each of the spark worker nodes.
>> > The message is compiled using 2.5 but then on runtime it is being
>> > de-serialized by 2.4.1 as I'm getting the following exception
>> >
>> > java.lang.VerifyError (java.lang.VerifyError: class
>> > com.snc.sinet.messages.XServerMessage$XServer overrides final method
>> > getUnknownFields.()Lcom/google/protobuf/UnknownFieldSet;)
>> > java.lang.ClassLoader.defineClass1(Native Method)
>> > java.lang.ClassLoader.defineClassCond(ClassLoader.java:631)
>> > java.lang.ClassLoader.defineClass(ClassLoader.java:615)
>> > java.security.SecureClassLoader.defineClass(SecureClassLoader.java:141)
>> >
>> > Suggestions on how I could still use ProtoBuf 2.5. Based on the article -
>> > https://spark-project.atlassian.net/browse/SPARK-995 we should be able to
>> > use different version of protobuf in the application.
>> >
>> >
>> >
>> >
>> >
>> > --
>> > View this message in context: 
>> > http://apache-spark-user-list.1001560.n3.nabble.com/Using-ProtoBuf-2-5-for-messages-with-Spark-Streaming-tp3396.html
>> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
>> 
>> 
> 
>

Re: Using ProtoBuf 2.5 for messages with Spark Streaming

2014-04-01 Thread Patrick Wendell

Vidal - could you show exactly what flags/commands you are using when you
build spark to produce this assembly?


On Tue, Apr 1, 2014 at 12:53 AM, Vipul Pandey  wrote:

> Spark now shades its own protobuf dependency so protobuf 2.4.1 should't be
> getting pulled in unless you are directly using akka yourself. Are you?
>
> No i'm not. Although I see that protobuf libraries are directly pulled
> into the 0.9.0 assembly jar - I do see the shaded version as well.
> e.g. below for Message.class
>
> -bash-4.1$ jar -ftv
> ./assembly/target/scala-2.10/spark-assembly-0.9.0-incubating-hadoop2.0.0-cdh4.2.1.jar
> | grep protobuf | grep /Message.class
>478 Thu Jun 30 15:26:12 PDT 2011 com/google/protobuf/Message.class
>508 Sat Dec 14 14:20:38 PST 2013 com/google/protobuf_spark/Message.class
>
>
> Does your project have other dependencies that might be indirectly pulling
> in protobuf 2.4.1? It would be helpful if you could list all of your
> dependencies including the exact Spark version and other libraries.
>
> I did have another one which I moved to the end of classpath - even ran
> partial code without that dependency but it still failed whenever I use the
> jar with ScalaBuf dependency.
> Spark version is 0.9.0
>
>
> ~Vipul
>
> On Mar 31, 2014, at 4:51 PM, Patrick Wendell  wrote:
>
> Spark now shades its own protobuf dependency so protobuf 2.4.1 should't be
> getting pulled in unless you are directly using akka yourself. Are you?
>
> Does your project have other dependencies that might be indirectly pulling
> in protobuf 2.4.1? It would be helpful if you could list all of your
> dependencies including the exact Spark version and other libraries.
>
> - Patrick
>
>
> On Sun, Mar 30, 2014 at 10:03 PM, Vipul Pandey  wrote:
>
>> I'm using ScalaBuff (which depends on protobuf2.5) and facing the same
>> issue. any word on this one?
>> On Mar 27, 2014, at 6:41 PM, Kanwaldeep  wrote:
>>
>> > We are using Protocol Buffer 2.5 to send messages to Spark Streaming
>> 0.9 with
>> > Kafka stream setup. I have protocol Buffer 2.5 part of the uber jar
>> deployed
>> > on each of the spark worker nodes.
>> > The message is compiled using 2.5 but then on runtime it is being
>> > de-serialized by 2.4.1 as I'm getting the following exception
>> >
>> > java.lang.VerifyError (java.lang.VerifyError: class
>> > com.snc.sinet.messages.XServerMessage$XServer overrides final method
>> > getUnknownFields.()Lcom/google/protobuf/UnknownFieldSet;)
>> > java.lang.ClassLoader.defineClass1(Native Method)
>> > java.lang.ClassLoader.defineClassCond(ClassLoader.java:631)
>> > java.lang.ClassLoader.defineClass(ClassLoader.java:615)
>> > java.security.SecureClassLoader.defineClass(SecureClassLoader.java:141)
>> >
>> > Suggestions on how I could still use ProtoBuf 2.5. Based on the article
>> -
>> > https://spark-project.atlassian.net/browse/SPARK-995 we should be able
>> to
>> > use different version of protobuf in the application.
>> >
>> >
>> >
>> >
>> >
>> > --
>> > View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Using-ProtoBuf-2-5-for-messages-with-Spark-Streaming-tp3396.html
>> > Sent from the Apache Spark User List mailing list archive at Nabble.com
>> .
>>
>>
>
>

Re: Using ProtoBuf 2.5 for messages with Spark Streaming

2014-04-01 Thread Kanwaldeep

I've removed the dependency on akka in a separate project but still running
into the same error. In the POM Dependency Hierarchy I do see 2.4.1 - shaded
and 2.5.0 being included. If there is a conflict with project dependency I
would think I should be getting the same error in my local setup as well.

Here is the dependencies I'm using.



ch.qos.logback
logback-core
1.0.13


ch.qos.logback
logback-classic
1.0.13



org.apache.spark
spark-core_2.10
0.9.0-incubating



org.apache.spark
spark-streaming_2.10
0.9.0-incubating


org.apache.spark
spark-streaming-kafka_2.10
0.9.0-incubating




  
org.apache.hbase
hbase
0.94.15-cdh4.6.0

  


org.apache.hadoop
hadoop-client
2.0.0-cdh4.6.0
  

com.google.protobuf
protobuf-java
2.5.0
 


org.slf4j
slf4j-api
1.7.5




org.scala-lang
scala-library
2.10.2



org.scala-lang
scala-actors
2.10.2


org.scala-lang
scala-reflect
2.10.2



org.slf4j
slf4j-api
1.7.5







--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Using-ProtoBuf-2-5-for-messages-with-Spark-Streaming-tp3396p3585.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Is there a way to get the current progress of the job?

2014-04-01 Thread Kevin Markey


  
  
The discussion there hits on the distinction of jobs and stages. 
When looking at one application, there are hundreds of stages,
sometimes thousands.  Depends on the data and the task.  And the UI
seems to track stages.  And one could independently track them for
such a job.  But what if -- as occurs in another application --
there's only one or two stages, but lots of data passing through
those 1 or 2 stages?

Kevin Markey


On 04/01/2014 09:55 AM, Mark Hamstra
  wrote:


  Some related discussion: https://github.com/apache/spark/pull/246
  

On Tue, Apr 1, 2014 at 8:43 AM, Philip
  Ogren 
  wrote:
  Hi DB,

Just wondering if you ever got an answer to your question
about monitoring progress - either offline or through your
own investigation.  Any findings would be appreciated.

Thanks,
Philip

  

On 01/30/2014 10:32 PM, DB Tsai wrote:

  Hi guys,
  
  When we're running a very long job, we would like to
  show users the current progress of map and reduce job.
  After looking at the api document, I don't find
  anything for this. However, in Spark UI, I could see
  the progress of the task. Is there anything I miss?
  
  Thanks.
  
  Sincerely,
  
  DB Tsai
  Machine Learning Engineer
  Alpine Data Labs
  --
  Web: http://alpinenow.com/

Re: Using ProtoBuf 2.5 for messages with Spark Streaming

2014-04-01 Thread Kanwaldeep

Yes I'm using akka as well. But if that is the problem then I should have
been facing this issue in my local setup as well. I'm only running into this
error on using the spark standalone cluster.

But will try out your suggestion and let you know.

Thanks
Kanwal



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Using-ProtoBuf-2-5-for-messages-with-Spark-Streaming-tp3396p3582.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Mllib in pyspark for 0.8.1

2014-04-01 Thread Matei Zaharia

You could probably port it back, but it required some changes on the Java side 
as well (a new PythonMLUtils class). It might be easier to fix the Mesos issues 
with 0.9.

Matei

On Apr 1, 2014, at 8:53 AM, Ian Ferreira  wrote:

> 
> Hi there,
> 
> For some reason the distribution and build for 0.8.1 does not include
> the MLLib libraries for pyspark i.e. import from mllib fails.
> 
> Seems to be addressed in 0.9.0, but that has other issue running on
> mesos in standalone mode :)
> 
> Any pointers?
> 
> Cheers
> - Ian
>

Re: Is there a way to get the current progress of the job?

2014-04-01 Thread Mark Hamstra

Some related discussion: https://github.com/apache/spark/pull/246


On Tue, Apr 1, 2014 at 8:43 AM, Philip Ogren wrote:

> Hi DB,
>
> Just wondering if you ever got an answer to your question about monitoring
> progress - either offline or through your own investigation.  Any findings
> would be appreciated.
>
> Thanks,
> Philip
>
>
> On 01/30/2014 10:32 PM, DB Tsai wrote:
>
>> Hi guys,
>>
>> When we're running a very long job, we would like to show users the
>> current progress of map and reduce job. After looking at the api document,
>> I don't find anything for this. However, in Spark UI, I could see the
>> progress of the task. Is there anything I miss?
>>
>> Thanks.
>>
>> Sincerely,
>>
>> DB Tsai
>> Machine Learning Engineer
>> Alpine Data Labs
>> --
>> Web: http://alpinenow.com/
>>
>
>

Mllib in pyspark for 0.8.1

2014-04-01 Thread Ian Ferreira


Hi there,

For some reason the distribution and build for 0.8.1 does not include
the MLLib libraries for pyspark i.e. import from mllib fails.

Seems to be addressed in 0.9.0, but that has other issue running on
mesos in standalone mode :)

Any pointers?

Cheers
- Ian

Re: Is there a way to get the current progress of the job?

2014-04-01 Thread Philip Ogren


Hi DB,

Just wondering if you ever got an answer to your question about 
monitoring progress - either offline or through your own investigation.  
Any findings would be appreciated.


Thanks,
Philip

On 01/30/2014 10:32 PM, DB Tsai wrote:

Hi guys,

When we're running a very long job, we would like to show users the 
current progress of map and reduce job. After looking at the api 
document, I don't find anything for this. However, in Spark UI, I 
could see the progress of the task. Is there anything I miss?


Thanks.

Sincerely,

DB Tsai
Machine Learning Engineer
Alpine Data Labs
--
Web: http://alpinenow.com/

Use combineByKey and StatCount

2014-04-01 Thread Jaonary Rabarisoa

Hi all;

Can someone give me some tips to compute mean of RDD by key , maybe with
combineByKey and StatCount.

Cheers,

Jaonary

custom receiver in java

2014-04-01 Thread eric perler

i would like to write a custom receiver to receive data from a Tibco RV subject
i found this scala example..
http://spark.incubator.apache.org/docs/0.8.0/streaming-custom-receivers.html
but i cant seem to find a java example
does anybody know of a good java example for creating a custom receiver
thx

Re: Unable to submit an application to standalone cluster which on hdfs.

2014-04-01 Thread haikal.pribadi

How do you remove the validation blocker from the compilation?

Thank you



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-submit-an-application-to-standalone-cluster-which-on-hdfs-tp1730p3574.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

foreach not working

2014-04-01 Thread eric perler

hello..
i am on my second day with spark.. and im having trouble getting the foreach 
function to work with the network wordcount example.. i can see the the 
"flatMap" and "map" methods are being invoked.. but i dont seem to be getting 
into the foreach method... not sure if what i am doing even makes sense.. any 
help is appreciated... thx !!!
JavaDStream lines = ssc.socketTextStream("localhost", 
1234);
JavaDStream words = lines   
.flatMap(new FlatMapFunction() {
@Override   public Iterable 
call(String x) {//-- this is 
being invoked  return 
Lists.newArrayList(x.split(" "));}  
 });
JavaPairDStream wordCounts = words 
.map(new PairFunction() {  
@Override   public 
Tuple2 call(String s) throws Exception {   
 //-- this is being invoked 
 return new Tuple2(s, 1);  
 }   });
wordCounts.foreach(collectTuplesFunc);
ssc.start();ssc.awaitTermination(); }
Function collectTuplesFunc = new Function, Void>, Void>() {
@Override   public Void 
call(JavaPairRDD, Void> arg0)
throws Exception {  //-- this is NOT being invoked  
return null;}   };
i am assuming that in the foreach call is where i would write to an external 
system.. please correct me if this assumption is wrong
thanks again

Sliding Subwindows

2014-04-01 Thread aecc

Hello, I would like to have a kind of sub windows. The idea is to have 3
windows in the following way:

future   <> <-> <-->   
 past
 w1 w2   w3

So I can do some processing with the new data coming (w1) to main main
window (w2) and some processing on the data leaving the window (w3)

Any ideas of how can I do this in Spark? Is there a way to create sub
windows? or to specify when a window should start reading?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Sliding-Subwindows-tp3572.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

SSH problem

2014-04-01 Thread Sai Prasanna

Hi All,

I have a five node spark cluster, Master, s1,s2,s3,s4.

I have passwordless ssh to all slaves from master and vice-versa.
But only one machine, s2, what happens is after 2-3 minutes of my
connection from master to slave, the write-pipe is broken. So if try to
connect again from master i get the error,
"ssh: Connect to host s1 port 22: Connection refused".
But i can login from s2 to master, and after doing it for next 2-3 min i
get access from master to s2.

What is this strange behaviour ?
I have set complete access in hosts.allow also !!

Re: Using ProtoBuf 2.5 for messages with Spark Streaming

2014-04-01 Thread Vipul Pandey

btw, this is where it fails 

14/04/01 00:59:32 INFO storage.MemoryStore: ensureFreeSpace(84106) called with 
curMem=0, maxMem=4939225497
14/04/01 00:59:32 INFO storage.MemoryStore: Block broadcast_0 stored as values 
to memory (estimated size 82.1 KB, free 4.6 GB)
java.lang.UnsupportedOperationException: This is supposed to be overridden by 
subclasses.
at 
com.google.protobuf.GeneratedMessage.getUnknownFields(GeneratedMessage.java:180)
at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$GetFileInfoRequestProto.getSerializedSize(ClientNamenodeProtocolProtos.java:30042)
at 
com.google.protobuf.AbstractMessageLite.toByteString(AbstractMessageLite.java:49)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.constructRpcRequest(ProtobufRpcEngine.java:149)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:193)
at $Proxy14.getFileInfo(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83)
at $Proxy14.getFileInfo(Unknown Source)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:628)
at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1545)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:805)
at 
org.apache.hadoop.fs.FileSystem.globStatusInternal(FileSystem.java:1670)
at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1616)
at 
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:174)
at 
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:205)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:140)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:205)
at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:205)
at 
org.apache.spark.rdd.FlatMappedRDD.getPartitions(FlatMappedRDD.scala:30)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:205)
at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:205)
at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:205)
at 
org.apache.spark.Partitioner$.defaultPartitioner(Partitioner.scala:58)
at 
org.apache.spark.rdd.PairRDDFunctions.reduceByKey(PairRDDFunctions.scala:354)

On Apr 1, 2014, at 12:53 AM, Vipul Pandey  wrote:

>> Spark now shades its own protobuf dependency so protobuf 2.4.1 should't be 
>> getting pulled in unless you are directly using akka yourself. Are you?
> 
> No i'm not. Although I see that protobuf libraries are directly pulled into 
> the 0.9.0 assembly jar - I do see the shaded version as well. 
> e.g. below for Message.class
> 
> -bash-4.1$ jar -ftv 
> ./assembly/target/scala-2.10/spark-assembly-0.9.0-incubating-hadoop2.0.0-cdh4.2.1.jar
>  | grep protobuf | grep /Message.class
>478 Thu Jun 30 15:26:12 PDT 2011 com/google/protobuf/Message.class
>508 Sat Dec 14 14:20:38 PST 2013 com/google/protobuf_spark/Message.class
> 
> 
>> Does your project have other dependencies that might be indirectly pulling 
>> in protobuf 2.4.1? It would be helpful if you could list all of your 
>> dependencies including the exa

Re: Using ProtoBuf 2.5 for messages with Spark Streaming

2014-04-01 Thread Vipul Pandey

> Spark now shades its own protobuf dependency so protobuf 2.4.1 should't be 
> getting pulled in unless you are directly using akka yourself. Are you?

No i'm not. Although I see that protobuf libraries are directly pulled into the 
0.9.0 assembly jar - I do see the shaded version as well. 
e.g. below for Message.class

-bash-4.1$ jar -ftv 
./assembly/target/scala-2.10/spark-assembly-0.9.0-incubating-hadoop2.0.0-cdh4.2.1.jar
 | grep protobuf | grep /Message.class
   478 Thu Jun 30 15:26:12 PDT 2011 com/google/protobuf/Message.class
   508 Sat Dec 14 14:20:38 PST 2013 com/google/protobuf_spark/Message.class


> Does your project have other dependencies that might be indirectly pulling in 
> protobuf 2.4.1? It would be helpful if you could list all of your 
> dependencies including the exact Spark version and other libraries.

I did have another one which I moved to the end of classpath - even ran partial 
code without that dependency but it still failed whenever I use the jar with 
ScalaBuf dependency. 
Spark version is 0.9.0


~Vipul

On Mar 31, 2014, at 4:51 PM, Patrick Wendell  wrote:

> Spark now shades its own protobuf dependency so protobuf 2.4.1 should't be 
> getting pulled in unless you are directly using akka yourself. Are you?
> 
> Does your project have other dependencies that might be indirectly pulling in 
> protobuf 2.4.1? It would be helpful if you could list all of your 
> dependencies including the exact Spark version and other libraries.
> 
> - Patrick
> 
> 
> On Sun, Mar 30, 2014 at 10:03 PM, Vipul Pandey  wrote:
> I'm using ScalaBuff (which depends on protobuf2.5) and facing the same issue. 
> any word on this one?
> On Mar 27, 2014, at 6:41 PM, Kanwaldeep  wrote:
> 
> > We are using Protocol Buffer 2.5 to send messages to Spark Streaming 0.9 
> > with
> > Kafka stream setup. I have protocol Buffer 2.5 part of the uber jar deployed
> > on each of the spark worker nodes.
> > The message is compiled using 2.5 but then on runtime it is being
> > de-serialized by 2.4.1 as I'm getting the following exception
> >
> > java.lang.VerifyError (java.lang.VerifyError: class
> > com.snc.sinet.messages.XServerMessage$XServer overrides final method
> > getUnknownFields.()Lcom/google/protobuf/UnknownFieldSet;)
> > java.lang.ClassLoader.defineClass1(Native Method)
> > java.lang.ClassLoader.defineClassCond(ClassLoader.java:631)
> > java.lang.ClassLoader.defineClass(ClassLoader.java:615)
> > java.security.SecureClassLoader.defineClass(SecureClassLoader.java:141)
> >
> > Suggestions on how I could still use ProtoBuf 2.5. Based on the article -
> > https://spark-project.atlassian.net/browse/SPARK-995 we should be able to
> > use different version of protobuf in the application.
> >
> >
> >
> >
> >
> > --
> > View this message in context: 
> > http://apache-spark-user-list.1001560.n3.nabble.com/Using-ProtoBuf-2-5-for-messages-with-Spark-Streaming-tp3396.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
>

Re: Configuring distributed caching with Spark and YARN

2014-04-01 Thread santhoma

I think with addJar() there is no 'caching',  in the sense files will be
copied everytime per job.
Whereas in hadoop distributed cache, files will be copied only once, and a
symlink will be created to the cache file for subsequent runs:
https://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/filecache/DistributedCache.html

Also,hadoop distributed cache can copy an archive  file to the node and
unzip it automatically to current working dir. The advantage here is that
the copying will be very fast..

Still looking for similar  mechanisms in SPARK




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Configuring-distributed-caching-with-Spark-and-YARN-tp1074p3566.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

47 matches

Mail list logo