Re: MLib : Non Linear Optimization

2016-09-07 Thread nsareen
Any answer to this question group ?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/MLib-Non-Linear-Optimization-tp27645p27676.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



MLib : Non Linear Optimization

2016-09-01 Thread nsareen
I'm part of an Predictive Analytics marketing platform. We do a lot of
Optimizations ( non linear ), currently using SAS / Lindo routines. I was
going through Spark's MLib documentation & found it supports Linear
Optimization, was wondering if it also supports Non Linear Optimization & if
not, are there any plans to implement it in spark ? We really want to move
away from SAS  since it is a very expensive solution & does not work on a
distributed scale. We want a solution which provides scalability & at the
same time provide accurate results.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/MLib-Non-Linear-Optimization-tp27645.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



input size too large | Performance issues with Spark

2015-03-28 Thread nsareen
Hi All,

I'm facing performance issues with spark implementation, and was briefly
investigating on WebUI logs, i noticed that my RDD size is 55GB  the
Shuffle Write is 10 GB  Input Size is 200GB. Application is a web
application which does predictive analytics, so we keep most of our data in
memory. This observation was only for 30mins usage of the application on a
single user. We anticipate atleast 10-15 users of the application sending
requests in parallel, which makes me a bit nervous. 

One constraint we have is that we do not have too many nodes in a cluster,
we may end up with 3-4 machines at best, but they can be scaled up
vertically each having 24 cores / 512 GB ram etc. which can allow us to make
a virtual 10-15 node cluster. 

Even then the input size  shuffle write is too high for my liking. Any
suggestions in this regard will be greatly appreciated as there aren't much
resource on the net for handling performance issues such as these.

Some pointers on my application's data structures  design 

1) RDD is a JavaPairRDD with Key / Value as CustomPOJO containing 3-4
Hashmaps  Value containing 1 Hashmap
2) Data is loaded via JDBCRDD during application startup, which also tends
to take a lot of time, since we massage the data once it is fetched from DB
and then save it as JavaPairRDD.
3) Most of the data is structured, but we are still using JavaPairRDD, have
not explored the option of Spark SQL though.
4) We have only one SparkContext which caters to all the requests coming
into the application from various users.
5) During a single user session user can send 3-4 parallel stages consisting
of Map / Group By / Join / Reduce etc.
6) We have to change the RDD structure using different types of group by
operations since the user can do drill down drill up of the data (
aggregation at a higher / lower level). This is where we make use of
Groupby's but there is a cost associated with this.
7) We have observed, that the initial RDD's we create have 40 odd
partitions, but post some stage executions like groupby's the partitions
increase to 200 or so, this was odd, and we havn't figured out why this
happens.

In summary we wan to use Spark to provide us the capability to process our
in-memory data structure very fast as well as scale to a larger volume when
required in the future.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/input-size-too-large-Performance-issues-with-Spark-tp22270.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Does filter on an RDD scan every data item ?

2014-12-15 Thread nsareen
Thanks! shall try it out.





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Does-filter-on-an-RDD-scan-every-data-item-tp20170p20683.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Does filter on an RDD scan every data item ?

2014-12-05 Thread nsareen
Any thoughts, how could Spark SQL help in our scenario ?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Does-filter-on-an-RDD-scan-every-data-item-tp20170p20465.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Does filter on an RDD scan every data item ?

2014-12-04 Thread nsareen
Thanks for the reply!

To be honest, I was expecting spark to have some sort of Indexing for keys,
which would help it locate the keys efficiently.

I wasn't using Spark SQL here, but if it helps perform this efficiently, i
can try it out, can you please elaborate, how will it be helpful in this
scenario ?

Thanks,
Nitin.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Does-filter-on-an-RDD-scan-every-data-item-tp20170p20365.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Does filter on an RDD scan every data item ?

2014-12-04 Thread nsareen
I'm not sure sample is what i was looking for. 

As mentioned in another post above. this is what i'm looking for.

1) My RDD contains this structure. Tuple2CustomTuple,Double.
2) Each CustomTuple is a combination of string id's e.g. 
CustomTuple.dimensionOne=AE232323
CustomTuple.dimensionTwo=BE232323
CustomTuple.dimensionThree=CE232323
and so on ---
3) CustomTuple has overridden equals  hash implementation which helps
identify unique objects and equality if values in dimensionOne,Two,Three
match for two distinct objects.
4) Double is a numberic value.
5) I want to create RDD of  50-100Million or more such tuples in Spark,
which can grow over time.
6) My Web Application would request to process a subset of these millions of
rows. The processing is nothing but aggregation / arithmetic functions over
this data set. We felt spark would be the right candidate to process this in
distributed fashion and also would help scalability for future. Where we are
stuck is that, in case the application requests a subset comprising of
100thousand tuples, we would have to construct these many CustomTuple
objects and pass them via Spark Driver program to the filter function, which
in turn would go and scan these 100 million rows to generate the subset. 

I was of the assumption, that since Spark allows Key / Value storage, there
would be some indexing for the Keys stored, which would help spark locate
objects.






--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Does-filter-on-an-RDD-scan-every-data-item-tp20170p20366.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Does filter on an RDD scan every data item ?

2014-12-02 Thread nsareen
Hi ,

I wanted some clarity into the functioning of Filter function of RDD.

1) Does filter function scan every element saved in RDD? if my RDD
represents 10 Million rows, and if i want to work on only 1000 of them, is
there an efficient way of filtering the subset without having to scan every
element ?

2) If my RDD represents a Key / Value data set. When i filter this data set
of 10 Million rows, can i specify that the search should be restricted to
only partitions which contain specific keys ? Will spark run by filter
operation on all partitions if the partitions are done by key, irrespective
the key exists in a partition or not ? 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Does-filter-on-an-RDD-scan-every-data-item-tp20170.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Calling spark from a java web application.

2014-12-02 Thread nsareen
We have a web application which talks to spark server.

This is how we have done the integration.
1) In the tomcat's classpath, add the spark distribution jar for spark code
to be available at runtime ( for you it would be Jetty).
2) In the Web application project, add the spark distribution jar in the
classpath ( Could be Java / Web project).
3) Setup the FAIR scheduling mode, which would help send parallel requests
from web application to the spark cluster.
4) In our application startup, initialize the connection to the spark
cluster. This is composed of creating the JavaSparkContext and making it
available throughout the web application, in case this needs to be the only
Driver Program required by the web application.
5) Using the JavaSpark Context, Create RDD's and make them available
globally to the web application code. 
6) invoke transformation / actions as required.

Hopefully this info is of some use..



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Calling-spark-from-a-java-web-application-tp20007p20213.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RDD Action require data from Another RDD

2014-11-20 Thread nsareen
Hi,

We have a requirement, where we have two data sets represented by RDD's 

RDDA   RDDB.

For performing an aggregation operation on RDDA, the action would need
RDDB's subset of data, wanted to understand if there is a best practice in
doing this ? Dont even know how will this be possible as of now.

Help would be much appreciated.

Thanks in Advance.

Nitin



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/RDD-Action-require-data-from-Another-RDD-tp19353.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Efficient Key Structure in pairRDD

2014-11-11 Thread nsareen
Spark Dev / Users, help in this regard would be appreciated, we  are kind of
stuck at this point.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Efficient-Key-Structure-in-pairRDD-tp18461p18557.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Efficient Key Structure in pairRDD

2014-11-09 Thread nsareen
Hi,

We are trying to adopt Spark for our application.

We have an analytical application which stores data in Star Schemas ( SQL
Server ). All the cubes are loaded into a Key / Value structure and saved in
Trove ( in memory collection ). here key is a short array where each short
number represents a dimension member. 
e.g
Tuple = CampaignX,Product1,Region_south,10.23232 gets converted to 
Trove Key[[12322],[45232],[53421]]  Value[10.23232].

This is done to avoid saving collection of string objects as key in Trove.

Now can we save this data structure in Spark using pairRDD?  if yes, will
key value be an ideal way of storing data in spark and retrieving it for
data analysis, or is there any other better data structure we can  create,
which would help us create and process RDD ?

Nitin.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Efficient-Key-Structure-in-pairRDD-tp18461.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Task size variation while using Range Vs List

2014-11-06 Thread nsareen
Thanks for the response!! Will try to see the behaviour with Cache()



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Task-size-variation-while-using-Range-Vs-List-tp18243p18318.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Task size variation while using Range Vs List

2014-11-05 Thread nsareen
I noticed a behaviour where it was observed that, if i'm using 
val temp = sc.parallelize ( 1 to 10)

temp.collect

Task size will be in bytes let's say ( 1120 bytes).

But if i change this to a for loop 

import scala.collection.mutable.ArrayBuffer
val data= new ArrayBuffer[Integer]()
for(i - 1 to 100)data+=i
val distData = sc.parallelize(data)
distData.collect

Here the task size is in MB's 5000120 bytes.

Any inputs here would be appreciated, this is really confusing

1) Why does the data travel from Driver to Executor every time an Action is
performed ( i thought the data exists in the Executor's memory, and only the
code is pushed from driver to executor ) ??

2) Why does Range not increase the task size, where as any other collection
increases the size exponentially ??





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Task-size-variation-while-using-Range-Vs-List-tp18243.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: How to trace/debug serialization?

2014-11-05 Thread nsareen
From what i've observed, there are no debug logs while serialization takes
place. You can see the source code if you want, TaskSetManager class has
some functions for serialization.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-trace-debug-serialization-tp18230p18244.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Task Size Increases when using loops

2014-10-29 Thread nsareen
Hi,I'm new to spark, and am facing a peculiar problem.
I'm writing a simple Java Driver program where i'm creating Key / Value data
structure and collecting them, once created. The problem i'm facing is that,
when i increase the iterations of a for loop which creates the ArrayList of
Long Values which i have to put into the Key / Value data structure and save
in Spark as a Java Collection, the serialized size of tasks also increases
proportionately. 
e.g:  for Loop count : 10Task Size : 1120 bytes   
for Loop Count : 1  Task Size : 33402 bytesfor Loop
Count : 1000 Task Size : 453434 bytes   etc.
I'm not able to understand why Task size increases, i tried to run the same
example via Spark Shell, and i noticed the Task size remains the same,
irrespective of the loop iteration count.
Code :
@Override

public void execute() {
// do something 
List numbers = new ArrayList();

JavaRDD distData = null;

JavaPairRDDString, Long mapOfKeys = null;

JavaRDD keysRDD = null;

class ByKeyImpl implements FunctionLong, String, Serializable {
/**  *   */ 
private static final long serialVersionUID = 5749098182016143296L;

public String call(Long paramT1) throws Exception {
// TODO Auto-generated method stub  
StringBuilder builder = new
StringBuilder();
builder.append(paramT1).append(',').append(paramT1 +
1); return builder.toString();
}   }   

System.out.println( ** STARTING BENCHMARK  EXAMPLE
...*);

while(true) {   System.out.println( ** DO 
YOU WANT TO CONTINUE
? (YES/NO) *);
BufferedReader reader = new BufferedReader(new
InputStreamReader(System.in));  try {   
String continueString =
reader.readLine();  

if(yes.equalsIgnoreCase(continueString)) {

if( numbers.size() == 0 ) { 

// List not populated   

for (long i = 0; i  num; i++) {
numbers.add(i); 
}
}
// at this time numbers has 
long values in it.  // check for RDD
if already created or not.
if( distData == null) { 
System.out.println(
NEW RDD CREATED.);
if ( numPartitions  0) {   
distData =
sc.parallelize(numbers,numPartitions)   
;   } else 
{distData
= sc.parallelize(numbers)   
;   }   


}
// at this time, RDD is already 
present or newly created//
check if map is null or not 
if(mapOfKeys == null) { 

mapOfKeys = distData
.keyBy(new ByKeyImpl());



keysRDD = mapOfKeys.keys(); 
keysRDD.persist(StorageLevel.MEMORY_ONLY());

Spark Concepts

2014-10-15 Thread nsareen
Hi ,I'm pretty new to Big Data  Spark both. I've just started POC work on
spark and me  my team are evaluating it with other In Memory computing
tools such as GridGain, Bigmemory, Aerospike  some others too, specifically
to solve two sets of problems.1) Data  Storage : Our current application
runs on a single node which is a heavy configuration of 24 cores  350Geg,
our application loads all the datamart data inclusive  of multiple cubes
into the memory  converts it and keeps it in a Trove Collection in a form
of Key / Value map. This collection is a immutable collection which takes
about 15-20 Gegs of memory space. Our anticipation is that the data would
grow 10-15 folds in the next year or so  we are not very confident of Trove
being able to scale to that level.2) Compute: Ours in a natively Analytical
application doing predictive analytics with lots of simulations and
optimizations of scenarios, at the heart of all this are the Trove
Collections using which we perform our Mathematical algorithms to calculate
the end result, in doing so, the memory consumption of the application goes
beyond 250-300Geg. These are because of lots of intermediate computing
results ( collections ) which are further broken down to the granular level
and then searched in the Trove collection. All this happens on a single node
which obviously starts to perform slowly over a period of time. And based on
the large volume of data incoming in the next year or so, our current
architecture will not be able to handle such massive In Memory data set 
such computing power. Hence we are targeting to change the architecture to a
cluster based in memory distributed computing. We are evaluating all these
products along with Apache Spark. We were very excited by Apache spark
looking at the videos and some online resources, but when it came down to
doing handson we are facing lots of issues.1)What are Standalone Cluster's
limitations ? Can i configure a Cluster on a Single Node with Multiple
Processes of Worker Nodes, Executors etc. ? Is this supported even though
the IP Address would be the same ? 2) Why so many Java Processes ? Why are
there so many Java Processes ? Worker Nodes - Executors ? Will the
communication between them not slow down the performance on a whole ?3) How
is Parallelism on Partitioned Data achieved ? This one is really important
for us to understand, since are doing our benchmarkings on Partitioned data,
We do not know how to configure Partitions on Spark ? Any help here would be
appreciated. We want to partition data present in Cubes, hence we want Each
Cube to be a separate partition.4) What is the difference between Multiple
Nodes executing Jobs  Multiple Tasks Executing Jobs ? How do these handle
the partitioning  parallelism. Help in these questions would be really
appreciated, to get a better sense of Apache Spark.Thanks,Nitin



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Concepts-tp16477.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Spark Concepts

2014-10-15 Thread nsareen
Anybody with good hands on with Spark, please do reply. It would help us a
lot!!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Concepts-tp16477p16536.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org