Re: MLib : Non Linear Optimization
Any answer to this question group ? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/MLib-Non-Linear-Optimization-tp27645p27676.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
MLib : Non Linear Optimization
I'm part of an Predictive Analytics marketing platform. We do a lot of Optimizations ( non linear ), currently using SAS / Lindo routines. I was going through Spark's MLib documentation & found it supports Linear Optimization, was wondering if it also supports Non Linear Optimization & if not, are there any plans to implement it in spark ? We really want to move away from SAS since it is a very expensive solution & does not work on a distributed scale. We want a solution which provides scalability & at the same time provide accurate results. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/MLib-Non-Linear-Optimization-tp27645.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
input size too large | Performance issues with Spark
Hi All, I'm facing performance issues with spark implementation, and was briefly investigating on WebUI logs, i noticed that my RDD size is 55GB the Shuffle Write is 10 GB Input Size is 200GB. Application is a web application which does predictive analytics, so we keep most of our data in memory. This observation was only for 30mins usage of the application on a single user. We anticipate atleast 10-15 users of the application sending requests in parallel, which makes me a bit nervous. One constraint we have is that we do not have too many nodes in a cluster, we may end up with 3-4 machines at best, but they can be scaled up vertically each having 24 cores / 512 GB ram etc. which can allow us to make a virtual 10-15 node cluster. Even then the input size shuffle write is too high for my liking. Any suggestions in this regard will be greatly appreciated as there aren't much resource on the net for handling performance issues such as these. Some pointers on my application's data structures design 1) RDD is a JavaPairRDD with Key / Value as CustomPOJO containing 3-4 Hashmaps Value containing 1 Hashmap 2) Data is loaded via JDBCRDD during application startup, which also tends to take a lot of time, since we massage the data once it is fetched from DB and then save it as JavaPairRDD. 3) Most of the data is structured, but we are still using JavaPairRDD, have not explored the option of Spark SQL though. 4) We have only one SparkContext which caters to all the requests coming into the application from various users. 5) During a single user session user can send 3-4 parallel stages consisting of Map / Group By / Join / Reduce etc. 6) We have to change the RDD structure using different types of group by operations since the user can do drill down drill up of the data ( aggregation at a higher / lower level). This is where we make use of Groupby's but there is a cost associated with this. 7) We have observed, that the initial RDD's we create have 40 odd partitions, but post some stage executions like groupby's the partitions increase to 200 or so, this was odd, and we havn't figured out why this happens. In summary we wan to use Spark to provide us the capability to process our in-memory data structure very fast as well as scale to a larger volume when required in the future. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/input-size-too-large-Performance-issues-with-Spark-tp22270.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Does filter on an RDD scan every data item ?
Thanks! shall try it out. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Does-filter-on-an-RDD-scan-every-data-item-tp20170p20683.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Does filter on an RDD scan every data item ?
Any thoughts, how could Spark SQL help in our scenario ? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Does-filter-on-an-RDD-scan-every-data-item-tp20170p20465.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Does filter on an RDD scan every data item ?
Thanks for the reply! To be honest, I was expecting spark to have some sort of Indexing for keys, which would help it locate the keys efficiently. I wasn't using Spark SQL here, but if it helps perform this efficiently, i can try it out, can you please elaborate, how will it be helpful in this scenario ? Thanks, Nitin. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Does-filter-on-an-RDD-scan-every-data-item-tp20170p20365.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Does filter on an RDD scan every data item ?
I'm not sure sample is what i was looking for. As mentioned in another post above. this is what i'm looking for. 1) My RDD contains this structure. Tuple2CustomTuple,Double. 2) Each CustomTuple is a combination of string id's e.g. CustomTuple.dimensionOne=AE232323 CustomTuple.dimensionTwo=BE232323 CustomTuple.dimensionThree=CE232323 and so on --- 3) CustomTuple has overridden equals hash implementation which helps identify unique objects and equality if values in dimensionOne,Two,Three match for two distinct objects. 4) Double is a numberic value. 5) I want to create RDD of 50-100Million or more such tuples in Spark, which can grow over time. 6) My Web Application would request to process a subset of these millions of rows. The processing is nothing but aggregation / arithmetic functions over this data set. We felt spark would be the right candidate to process this in distributed fashion and also would help scalability for future. Where we are stuck is that, in case the application requests a subset comprising of 100thousand tuples, we would have to construct these many CustomTuple objects and pass them via Spark Driver program to the filter function, which in turn would go and scan these 100 million rows to generate the subset. I was of the assumption, that since Spark allows Key / Value storage, there would be some indexing for the Keys stored, which would help spark locate objects. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Does-filter-on-an-RDD-scan-every-data-item-tp20170p20366.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Does filter on an RDD scan every data item ?
Hi , I wanted some clarity into the functioning of Filter function of RDD. 1) Does filter function scan every element saved in RDD? if my RDD represents 10 Million rows, and if i want to work on only 1000 of them, is there an efficient way of filtering the subset without having to scan every element ? 2) If my RDD represents a Key / Value data set. When i filter this data set of 10 Million rows, can i specify that the search should be restricted to only partitions which contain specific keys ? Will spark run by filter operation on all partitions if the partitions are done by key, irrespective the key exists in a partition or not ? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Does-filter-on-an-RDD-scan-every-data-item-tp20170.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Calling spark from a java web application.
We have a web application which talks to spark server. This is how we have done the integration. 1) In the tomcat's classpath, add the spark distribution jar for spark code to be available at runtime ( for you it would be Jetty). 2) In the Web application project, add the spark distribution jar in the classpath ( Could be Java / Web project). 3) Setup the FAIR scheduling mode, which would help send parallel requests from web application to the spark cluster. 4) In our application startup, initialize the connection to the spark cluster. This is composed of creating the JavaSparkContext and making it available throughout the web application, in case this needs to be the only Driver Program required by the web application. 5) Using the JavaSpark Context, Create RDD's and make them available globally to the web application code. 6) invoke transformation / actions as required. Hopefully this info is of some use.. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Calling-spark-from-a-java-web-application-tp20007p20213.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
RDD Action require data from Another RDD
Hi, We have a requirement, where we have two data sets represented by RDD's RDDA RDDB. For performing an aggregation operation on RDDA, the action would need RDDB's subset of data, wanted to understand if there is a best practice in doing this ? Dont even know how will this be possible as of now. Help would be much appreciated. Thanks in Advance. Nitin -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/RDD-Action-require-data-from-Another-RDD-tp19353.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Efficient Key Structure in pairRDD
Spark Dev / Users, help in this regard would be appreciated, we are kind of stuck at this point. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Efficient-Key-Structure-in-pairRDD-tp18461p18557.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Efficient Key Structure in pairRDD
Hi, We are trying to adopt Spark for our application. We have an analytical application which stores data in Star Schemas ( SQL Server ). All the cubes are loaded into a Key / Value structure and saved in Trove ( in memory collection ). here key is a short array where each short number represents a dimension member. e.g Tuple = CampaignX,Product1,Region_south,10.23232 gets converted to Trove Key[[12322],[45232],[53421]] Value[10.23232]. This is done to avoid saving collection of string objects as key in Trove. Now can we save this data structure in Spark using pairRDD? if yes, will key value be an ideal way of storing data in spark and retrieving it for data analysis, or is there any other better data structure we can create, which would help us create and process RDD ? Nitin. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Efficient-Key-Structure-in-pairRDD-tp18461.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Task size variation while using Range Vs List
Thanks for the response!! Will try to see the behaviour with Cache() -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Task-size-variation-while-using-Range-Vs-List-tp18243p18318.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Task size variation while using Range Vs List
I noticed a behaviour where it was observed that, if i'm using val temp = sc.parallelize ( 1 to 10) temp.collect Task size will be in bytes let's say ( 1120 bytes). But if i change this to a for loop import scala.collection.mutable.ArrayBuffer val data= new ArrayBuffer[Integer]() for(i - 1 to 100)data+=i val distData = sc.parallelize(data) distData.collect Here the task size is in MB's 5000120 bytes. Any inputs here would be appreciated, this is really confusing 1) Why does the data travel from Driver to Executor every time an Action is performed ( i thought the data exists in the Executor's memory, and only the code is pushed from driver to executor ) ?? 2) Why does Range not increase the task size, where as any other collection increases the size exponentially ?? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Task-size-variation-while-using-Range-Vs-List-tp18243.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: How to trace/debug serialization?
From what i've observed, there are no debug logs while serialization takes place. You can see the source code if you want, TaskSetManager class has some functions for serialization. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-trace-debug-serialization-tp18230p18244.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Task Size Increases when using loops
Hi,I'm new to spark, and am facing a peculiar problem. I'm writing a simple Java Driver program where i'm creating Key / Value data structure and collecting them, once created. The problem i'm facing is that, when i increase the iterations of a for loop which creates the ArrayList of Long Values which i have to put into the Key / Value data structure and save in Spark as a Java Collection, the serialized size of tasks also increases proportionately. e.g: for Loop count : 10Task Size : 1120 bytes for Loop Count : 1 Task Size : 33402 bytesfor Loop Count : 1000 Task Size : 453434 bytes etc. I'm not able to understand why Task size increases, i tried to run the same example via Spark Shell, and i noticed the Task size remains the same, irrespective of the loop iteration count. Code : @Override public void execute() { // do something List numbers = new ArrayList(); JavaRDD distData = null; JavaPairRDDString, Long mapOfKeys = null; JavaRDD keysRDD = null; class ByKeyImpl implements FunctionLong, String, Serializable { /** * */ private static final long serialVersionUID = 5749098182016143296L; public String call(Long paramT1) throws Exception { // TODO Auto-generated method stub StringBuilder builder = new StringBuilder(); builder.append(paramT1).append(',').append(paramT1 + 1); return builder.toString(); } } System.out.println( ** STARTING BENCHMARK EXAMPLE ...*); while(true) { System.out.println( ** DO YOU WANT TO CONTINUE ? (YES/NO) *); BufferedReader reader = new BufferedReader(new InputStreamReader(System.in)); try { String continueString = reader.readLine(); if(yes.equalsIgnoreCase(continueString)) { if( numbers.size() == 0 ) { // List not populated for (long i = 0; i num; i++) { numbers.add(i); } } // at this time numbers has long values in it. // check for RDD if already created or not. if( distData == null) { System.out.println( NEW RDD CREATED.); if ( numPartitions 0) { distData = sc.parallelize(numbers,numPartitions) ; } else {distData = sc.parallelize(numbers) ; } } // at this time, RDD is already present or newly created// check if map is null or not if(mapOfKeys == null) { mapOfKeys = distData .keyBy(new ByKeyImpl()); keysRDD = mapOfKeys.keys(); keysRDD.persist(StorageLevel.MEMORY_ONLY());
Spark Concepts
Hi ,I'm pretty new to Big Data Spark both. I've just started POC work on spark and me my team are evaluating it with other In Memory computing tools such as GridGain, Bigmemory, Aerospike some others too, specifically to solve two sets of problems.1) Data Storage : Our current application runs on a single node which is a heavy configuration of 24 cores 350Geg, our application loads all the datamart data inclusive of multiple cubes into the memory converts it and keeps it in a Trove Collection in a form of Key / Value map. This collection is a immutable collection which takes about 15-20 Gegs of memory space. Our anticipation is that the data would grow 10-15 folds in the next year or so we are not very confident of Trove being able to scale to that level.2) Compute: Ours in a natively Analytical application doing predictive analytics with lots of simulations and optimizations of scenarios, at the heart of all this are the Trove Collections using which we perform our Mathematical algorithms to calculate the end result, in doing so, the memory consumption of the application goes beyond 250-300Geg. These are because of lots of intermediate computing results ( collections ) which are further broken down to the granular level and then searched in the Trove collection. All this happens on a single node which obviously starts to perform slowly over a period of time. And based on the large volume of data incoming in the next year or so, our current architecture will not be able to handle such massive In Memory data set such computing power. Hence we are targeting to change the architecture to a cluster based in memory distributed computing. We are evaluating all these products along with Apache Spark. We were very excited by Apache spark looking at the videos and some online resources, but when it came down to doing handson we are facing lots of issues.1)What are Standalone Cluster's limitations ? Can i configure a Cluster on a Single Node with Multiple Processes of Worker Nodes, Executors etc. ? Is this supported even though the IP Address would be the same ? 2) Why so many Java Processes ? Why are there so many Java Processes ? Worker Nodes - Executors ? Will the communication between them not slow down the performance on a whole ?3) How is Parallelism on Partitioned Data achieved ? This one is really important for us to understand, since are doing our benchmarkings on Partitioned data, We do not know how to configure Partitions on Spark ? Any help here would be appreciated. We want to partition data present in Cubes, hence we want Each Cube to be a separate partition.4) What is the difference between Multiple Nodes executing Jobs Multiple Tasks Executing Jobs ? How do these handle the partitioning parallelism. Help in these questions would be really appreciated, to get a better sense of Apache Spark.Thanks,Nitin -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Concepts-tp16477.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Spark Concepts
Anybody with good hands on with Spark, please do reply. It would help us a lot!! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Concepts-tp16477p16536.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org