Re: Persistent Local Node variables

2014-06-22 Thread Daedalus
Will using mapPartitions and creating a new RDD of ParsedData objects avoid
multiple parsing?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Persistent-Local-Node-variables-tp8104p8107.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Persistent Local Node variables

2014-06-22 Thread Daedalus
*TL;DR:* I want to run a pre-processing step on the data from each partition
(such as parsing) and retain the parsed object on each node for future
processing calls to avoid repeated parsing.

/More detail:/

I have a server and two nodes in my cluster, and data partitioned using
hdfs.
I am trying to use spark to process the data and send back results.

The data is available as text, and I would like to first parse this text,
and then run future processing.
To do this, I call a simple:
JavaRDD.foreachPartition(Iterator)(new
VoidFunction>(){
@Override
public void call(Iterator i){
ParsedData p=new ParsedData(i);
}
});

I would like to retain this ParsedData object on each node for future
processing calls, so as to avoid parsing all over again. So in my next call,
I'd like to do something like this:

JavaRDD.foreachPartition(Iterator)(new
VoidFunction>(){
@Override
public void call(Iterator i){
//refer to previously created ParsedData object
p.process();
//accumulate some results
}
});



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Persistent-Local-Node-variables-tp8104.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Repeated Broadcasts

2014-06-21 Thread Daedalus
Anyone who has used this sort of construct? (Read: bump)




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Repeated-Broadcasts-tp7977p8063.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Repeated Broadcasts

2014-06-19 Thread Daedalus
I'm trying to use Spark (Java) for an optimization algorithm that needs
repeated server-node exchanges of information. (The ADMM algorithm for
whoever is familiar). In each iteration, I need to update a set of values on
the nodes, and collect them on the server, which will update it's own set of
values, and pass this to ALL nodes.

Say each node optimizes a variable X={x1, x2, x3...}
While the server optimizes a variable Z={z1, z2, z3...}

I am currently using an Accumulable object to collect the updated X's from
each node into an array maintained on the server. 
Each node requires a copy of Z to optimize X, and this value of Z will
change on every iteration during optimization.

So, is there any computational advantage to using broadcasting Z at each
iteration over simply passing it as a parameter to each node?/ (Remember, Z
changes on each iteration)/

That is, which of the following snippets should I be implementing:

for(i=0; i(){
public void call(Data d){
X=d.optimize(broadVar.value());
accum.add(X);
}
});

Z=optimize_Z(accum);
}

*OR*

for(i=0; i(){
public void call(Data d){
X=d.optimize(Z);
accum.add(X);
}
});

Z=optimize_Z(accum);
}




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Repeated-Broadcasts-tp7977.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Serialization problem in Spark

2014-06-19 Thread Daedalus
I'm not sure if this is a Hadoop-centric issue or not. I had similar issues
with non-serializable external library classes.

I used a Kryo config (as illustrated  here
  ) and
registered the one troublesome class. It seemed to work after that.

Here's a link to the  thread

  
I asked on. Take a look at the other solutions proposed as well.




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Serialization-problem-in-Spark-tp7049p7975.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Un-serializable 3rd-party classes (Spark, Java)

2014-06-18 Thread Daedalus
Kryo did the job.
Thanks!


On Wed, Jun 18, 2014 at 10:44 AM, Matei Zaharia [via Apache Spark User
List]  wrote:

> There are a few options:
>
> - Kryo might be able to serialize these objects out of the box, depending
> what’s inside them. Try turning it on as described at
> http://spark.apache.org/docs/latest/tuning.html.
>
> - If that doesn’t work, you can create your own “wrapper” objects that
> implement Serializable, or even a subclass of FlexCompRowMatrix. No need to
> change the original library.
>
> - If the library has its own serialization functions, you could also use
> those inside a wrapper object. Take a look at
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SerializableWritable.scala
>  for
> an example where we make Hadoop’s Writables serializable.
>
> Matei
>
> On Jun 17, 2014, at 10:11 PM, Daedalus <[hidden email]
> <http://user/SendEmail.jtp?type=node&node=7816&i=0>> wrote:
>
> > I'm trying to use  matrix-toolkit-java
> > <https://github.com/fommil/matrix-toolkits-java/>   for an application
> of
> > mine, particularly ,the FlexCompRowMatrix class (used to store sparse
> > matrices).
> >
> > I have a class Dataframe -- which contains and int array, two double
> values,
> > and one FlexCompRowMatrix.
> >
> > When I try and run a simple Spark .foreach() on a JavaRDD created using
> a
> > list of the above mentioned Dataframes, I get the following errors:
> >
> > Exception in thread "main" org.apache.spark.SparkException: Job aborted
> due
> > to s
> > tage failure:* Task not serializable: java.io.NotSerializableException:
> > no.uib.ci
> > pr.matrix.sparse.FlexCompRowMatrix*
> >at
> > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DA
> > GScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033)
> >
> > The FlexCompRowMatrix doesn't seem to implement Serializable. This class
> > suits my purpose very well, and I would prefer not to switch over from
> it.
> >
> > Other than writing code to make the class serializable, and then
> recompiling
> > the matrix-toolkit-java source, what options do I have?
> >
> > Is there any workaround for this issue?
> >
> >
> >
> > --
> > View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Un-serializable-3rd-party-classes-Spark-Java-tp7815.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
>
>
> --
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/Un-serializable-3rd-party-classes-Spark-Java-tp7815p7816.html
>  To unsubscribe from Un-serializable 3rd-party classes (Spark, Java), click
> here
> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=7815&code=dHVzaGFyLm5hZ2FyYWphbkBnbWFpbC5jb218NzgxNXw2ODI3MDA0MDc=>
> .
> NAML
> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Un-serializable-3rd-party-classes-Spark-Java-tp7815p7891.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Un-serializable 3rd-party classes (Spark, Java)

2014-06-17 Thread Daedalus
I'm trying to use  matrix-toolkit-java
   for an application of
mine, particularly ,the FlexCompRowMatrix class (used to store sparse
matrices).

I have a class Dataframe -- which contains and int array, two double values,
and one FlexCompRowMatrix.

When I try and run a simple Spark .foreach() on a JavaRDD created using a
list of the above mentioned Dataframes, I get the following errors:

Exception in thread "main" org.apache.spark.SparkException: Job aborted due
to s
tage failure:* Task not serializable: java.io.NotSerializableException:
no.uib.ci
pr.matrix.sparse.FlexCompRowMatrix*
at
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DA
GScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033)

The FlexCompRowMatrix doesn't seem to implement Serializable. This class
suits my purpose very well, and I would prefer not to switch over from it.

Other than writing code to make the class serializable, and then recompiling
the matrix-toolkit-java source, what options do I have?

Is there any workaround for this issue?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Un-serializable-3rd-party-classes-Spark-Java-tp7815.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.