Re: Iterable of Strings

2014-09-09 Thread Sean Owen
These questions have been Scala questions, not Spark questions. It's better to look for answers on the internet or on discussion groups devoted to Scala. StackOverflow is good, for example. An array is indexed by integers, not strings. It's not even clear what you intend here. On Tue, Sep 9,

Re: Is the structure for a jar file for running Spark applications the same as that for Hadoop

2014-09-09 Thread Sean Owen
This structure is not specific to Hadoop, but in theory works in any JAR file. You can put JARs in JARs and refer to them with Class-Path entries in META-INF/MANIFEST.MF. It works but I have found it can cause trouble with programs that query the JARs on the classpath to find other classes. When

Re: [GraphX] how to set memory configurations to avoid OutOfMemoryError GC overhead limit exceeded

2014-09-09 Thread Ankur Dave
At 2014-09-05 12:13:18 +0200, Yifan LI iamyifa...@gmail.com wrote: But how to assign the storage level to a new vertices RDD that mapped from an existing vertices RDD, e.g. *val newVertexRDD = graph.collectNeighborIds(EdgeDirection.Out).map{case(id:VertexId, a:Array[VertexId]) = (id,

Spark streaming: size of DStream

2014-09-09 Thread julyfire
I want to implement the following logic: val stream = getFlumeStream() // a DStream if(size_of_stream 0) // if the DStream contains some RDD stream.someTransfromation stream.count() can figure out the number of RDD in a DStream, but it return a DStream[Long] and can't compare with a

Re: How to profile a spark application

2014-09-09 Thread julyfire
VisualVM is free and is enough in most situations -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-profile-a-spark-application-tp13684p13770.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Huge matrix

2014-09-09 Thread Debasish Das
Hi Xiangrui, For tall skinny matrices, if I can pass a similarityMeasure to computeGrammian, I could re-use the SVD's computeGrammian for similarity computation as well... Do you recommend using this approach for tall skinny matrices or just use the dimsum's routines ? Right now RowMatrix does

Re: groupBy gives non deterministic results

2014-09-09 Thread Davies Liu
What's the type of the key? If the hash of key is different across slaves, then you could get this confusing results. We had met this similar results in Python, because of hash of None is different across machines. Davies On Mon, Sep 8, 2014 at 8:16 AM, redocpot julien19890...@gmail.com wrote:

Accuracy hit in classification with Spark

2014-09-09 Thread jatinpreet
Hi, I had been using Mahout's Naive Bayes algorithm to classify document data. For a specific train and test set, I was getting accuracy in the range of 86%. When I shifted to Spark's MLlib, the accuracy dropped to the vicinity of 82%. I am using same version of Lucene and logic to generate

RE: Spark streaming: size of DStream

2014-09-09 Thread Shao, Saisai
Hi, Is there any specific scenario which needs to know the RDD numbers in the DStream? According to my knowledge DStream will generate one RDD in each right batchDuration, some old rdd will be remembered for windowing-like function, and will be removed when useless. The hashmap generatedRDDs

RE: Spark streaming: size of DStream

2014-09-09 Thread julyfire
Hi Jerry, Thanks for your reply. I use spark streaming to receive the flume stream, then I need to do a judgement, in each batchDuration, if the received stream has data, then I should do something, if no data, do the other thing. Then I thought the count() can give me the measure, but it returns

Re: Spark streaming: size of DStream

2014-09-09 Thread Sean Owen
How about calling foreachRDD, and processing whatever data is in each RDD normally, and also keeping track within the foreachRDD function of whether any RDD had a count() 0? if not, then you can execute at the end your alternate logic in the case of no data. I don't think you want to operate at

RE: Spark streaming: size of DStream

2014-09-09 Thread Shao, Saisai
Hi, I think all the received stream will generate a RDD in each batch duration even there is no data feed in (an empty RDD will be generated). So you cannot use number of RDD to judge whether there is any data received. One way is to do this in DStream/foreachRDD(), like a.foreachRDD { r = if

RE: Spark streaming: size of DStream

2014-09-09 Thread julyfire
Thanks all, yes, i did using foreachRDD, the following is my code: var count = -1L // a global variable in the main object val currentBatch = some_DStream val countDStream = currentBatch.map(o={ *count = 0L *// reset the count variable in each batch o })

Re: Spark driver application can not connect to Spark-Master

2014-09-09 Thread niranda
Hi, I had the same issue in my Java code while I was trying to connect to a locally hosted spark server (using sbin/start-all.sh etc) using an IDE (IntelliJ). I packaged my app into a jar and used spark-submit (in bin/) and it worked! Hope this helps Rgds -- View this message in context:

Re: Huge matrix

2014-09-09 Thread Reza Zadeh
Hi Deb, Did you mean to message me instead of Xiangrui? For TS matrices, dimsum with positiveinfinity and computeGramian have the same cost, so you can do either one. For dense matrices with say, 1m columns this won't be computationally feasible and you'll want to start sampling with dimsum. It

RE: Spark streaming: size of DStream

2014-09-09 Thread julyfire
i'm sorry I have some error in my code, update here: var count = -1L // a global variable in the main object val currentBatch = some_DStream val countDStream = currentBatch.map(o={ count = 0L // reset the count variable in each batch o }) countDStream.foreachRDD(rdd= count

Re: groupBy gives non deterministic results

2014-09-09 Thread Ye Xianjin
Can you provide small sample or test data that reproduce this problem? and what's your env setup? single node or cluster? Sent from my iPhone On 2014年9月8日, at 22:29, redocpot julien19890...@gmail.com wrote: Hi, I have a key-value RDD called rdd below. After a groupBy, I tried to count

Re: Spark streaming: size of DStream

2014-09-09 Thread Luis Ángel Vicente Sánchez
If you take into account what streaming means in spark, your goal doesn't really make sense; you have to assume that your streams are infinite and you will have to process them till the end of the days. Operations on a DStream define what you want to do with each element of each RDD, but spark

RE: Spark streaming: size of DStream

2014-09-09 Thread Shao, Saisai
I think you should clarify some things in Spark Streaming: 1. closure in map is running in the remote side, so modify count var will only take effect in remote side. You will always get -1 in driver side. 2. some codes in closure in foreachRDD is lazily executed in each batch duration, while

RE: Spark streaming: size of DStream

2014-09-09 Thread julyfire
yes, I agree to directly transform on DStream even there is no data injected in this batch duration. while my situation is : Spark receive flume stream continurously, and I use updateStateByKey function to collect data for a key among several batches, then I will handle the collected data after

Re: Filter function problem

2014-09-09 Thread Blackeye
In order to help anyone to answer i could say that i checked the inactiveIDs.filter operation seperated, and I found that it doesn't return null in any case. In addition i don't how to handle (or check) whether a RDD is null. I find the debugging to complicated to point the error. Any ideas how to

Re: Querying a parquet file in s3 with an ec2 install

2014-09-09 Thread Jim Carroll
My apologies to the list. I replied to Manu's question and it went directly to him rather than the list. In case anyone else has this issue here is my reply and Manu's reply to me. This also answers Ian's question. --- Hi Manu, The dataset is 7.5 million

Re: Querying a parquet file in s3 with an ec2 install

2014-09-09 Thread Jim Carroll
Why I think its the number of files is that I believe that a all of those or large part of those files are read when you run sqlContext.parquetFile() and the time it would take in s3 for that to happen is a lot so something internally is timing out.. I'll create the parquet files with Drill

Re: Accuracy hit in classification with Spark

2014-09-09 Thread jatinpreet
Hi, I tried running the classification program on the famous newsgroup data. This had an even more drastic effect on the accuracy, as it dropped from ~82% in Mahout to ~72% in Spark MLlib. Please help me in this regard as I have to use Spark in a production system very soon and this is a blocker

Re: Accuracy hit in classification with Spark

2014-09-09 Thread jatinpreet
Hi, I tried running the classification program on the famous newsgroup data. This had an even more drastic effect on the accuracy, as it dropped from ~82% in Mahout to ~72% in Spark MLlib. Please help me in this regard as I have to use Spark in a production system very soon and this is a blocker

spark functionality similar to hadoop's RecordWriter close method

2014-09-09 Thread robertberta
I want to call a function for batches of elements from an rdd val javaClass:org.apache.spark.api.java.function.Function[Seq[String],Unit] = new JavaClass() rdd.mapPartitions(_.grouped(5)).foreach(javaClass) 1.This worked fine in spark 0.9.1 , when we upgrade to spark 1.0.2 , Function changed

Re: spark functionality similar to hadoop's RecordWriter close method

2014-09-09 Thread Sean Owen
You're mixing the Java and Scala APIs here. Your call to foreach() is expecting a Scala function and you're giving it a Java Function. Ideally you just use the Scala API, of course. Before explaining how to actually use a Java function here, maybe clarify that you have to do it and can't use Scala

Re: groupBy gives non deterministic results

2014-09-09 Thread redocpot
Thank you for your replies. More details here: The prog is executed on local mode (single node). Default env params are used. The test code and the result are in this gist: https://gist.github.com/coderh/0147467f0b185462048c Here is 10 first lines of the data: 3 fields each row, the delimiter

Re: Huge matrix

2014-09-09 Thread Debasish Das
Cool...can I add loadRowMatrix in your PR ? Thanks. Deb On Tue, Sep 9, 2014 at 1:14 AM, Reza Zadeh r...@databricks.com wrote: Hi Deb, Did you mean to message me instead of Xiangrui? For TS matrices, dimsum with positiveinfinity and computeGramian have the same cost, so you can do either

Re: Accuracy hit in classification with Spark

2014-09-09 Thread Xiangrui Meng
If you are using the Mahout's Multinomial Naive Bayes, it should be the same as MLlib's. I tried MLlib with news20.scale downloaded from http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass.html and the test accuracy is 82.4%. -Xiangrui On Tue, Sep 9, 2014 at 4:58 AM, jatinpreet

Re: Huge matrix

2014-09-09 Thread Reza Zadeh
Better to do it in a PR of your own, it's not sufficiently related to dimsum On Tue, Sep 9, 2014 at 7:03 AM, Debasish Das debasish.da...@gmail.com wrote: Cool...can I add loadRowMatrix in your PR ? Thanks. Deb On Tue, Sep 9, 2014 at 1:14 AM, Reza Zadeh r...@databricks.com wrote: Hi Deb,

Re: Querying a parquet file in s3 with an ec2 install

2014-09-09 Thread Jim Carroll
Okay, This seems to be either a code version issue or a communication issue. It works if I execute the spark shell from the master node. It doesn't work if I run it from my laptop and connect to the master node. I had opened the ports for the WebUI (8080) and the cluster manager (7077) for the

PySpark on Yarn - how group by data properly

2014-09-09 Thread Oleg Ruchovets
Hi , I came from map/reduce background and try to do quite trivial thing: I have a lot of files ( on hdfs ) - format is : 1 , 2 , 3 2 , 3 , 5 1 , 3, 5 2, 3 , 4 2 , 5, 1 I am actually need to group by key (first column) : key values 1 -- (2,3),(3,5) 2 --

RDD memory questions

2014-09-09 Thread Boxian Dong
I currently working on a machine learning project, which require the RDDs' content to be (mostly partially) updated during each iteration. Because the program will be converted directly from traditional python object-oriented code, the content of the RDD will be modified in the mapping function.

Re: If for YARN you use 'spark.yarn.jar', what is the LOCAL equivalent to that property ...

2014-09-09 Thread Marcelo Vanzin
Yes, that's how file: URLs are interpreted everywhere in Spark. (It's also explained in the link to the docs I posted earlier.) The second interpretation below is local: URLs in Spark, but that doesn't work with Yarn on Spark 1.0 (so it won't work with CDH 5.1 and older either). On Mon, Sep 8,

Re: Filter function problem

2014-09-09 Thread Burak Yavuz
Hi, val test = persons.value .map{tuple = (tuple._1, tuple._2 .filter{event = *inactiveIDs.filter(event2 = event2._1 == tuple._1).count() != 0})} Your problem is right between the asterisk. You can't make an RDD operation inside an RDD operation, because RDD's can't be serialized.

Re: groupBy gives non deterministic results

2014-09-09 Thread Davies Liu
Which version of Spark are you using? This bug had been fixed in 0.9.2, 1.0.2 and 1.1, could you upgrade to one of these versions to verify it? Davies On Tue, Sep 9, 2014 at 7:03 AM, redocpot julien19890...@gmail.com wrote: Thank you for your replies. More details here: The prog is

Re: RDD memory questions

2014-09-09 Thread Davies Liu
On Tue, Sep 9, 2014 at 10:07 AM, Boxian Dong box...@indoo.rs wrote: I currently working on a machine learning project, which require the RDDs' content to be (mostly partially) updated during each iteration. Because the program will be converted directly from traditional python object-oriented

Re: Accuracy hit in classification with Spark

2014-09-09 Thread jatinpreet
Thanks for the information Xiangrui. I am using the following example to classify documents. http://chimpler.wordpress.com/2014/06/11/classifiying-documents-using-naive-bayes-on-apache-spark-mllib/ I am not sure if this is the best way to convert textual data into vectors. Can you please confirm

Re: Accuracy hit in classification with Spark

2014-09-09 Thread jatinpreet
I have also ran some tests on the other algorithms available with MLlib but got dismal accuracy. Is the method of creating LabeledPoint RDD different for other algorithms such as, LinearRegressionWithSGD? Any help is appreciated. - Novice Big Data Programmer -- View this message in

Re: PySpark on Yarn - how group by data properly

2014-09-09 Thread Davies Liu
On Tue, Sep 9, 2014 at 9:56 AM, Oleg Ruchovets oruchov...@gmail.com wrote: Hi , I came from map/reduce background and try to do quite trivial thing: I have a lot of files ( on hdfs ) - format is : 1 , 2 , 3 2 , 3 , 5 1 , 3, 5 2, 3 , 4 2 , 5, 1 I am actually need

Re: streaming: code to simulate a network socket data source

2014-09-09 Thread danilopds
Hello Diana, How can I include this implementation in my code, in terms of start this task together the NetworkWordCount. In my case, I have a directory with several files. Then, I include this line: StreamingDataGenerator.streamingGenerator(NetPort, BytesSecond, DirFiles) But the program

Re: Cannot run SimpleApp as regular Java app

2014-09-09 Thread Yana Kadiyska
spark-submit is a script which calls spark-class script. Can you output the command that spark-class runs (say, by putting set -x before the very last line?). You should see the java command that is being run. The scripts do some parameter setting so it's possible you're missing something. It

Re: Problem in running mosek in spark cluster - java.lang.UnsatisfiedLinkError: no mosekjava7_0 in java.library.path at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1738)

2014-09-09 Thread Yana Kadiyska
If that library has native dependencies you'd need to make sure that the native code is on all boxes and in the path with export SPARK_LIBRARY_PATH=... On Tue, Sep 9, 2014 at 10:17 AM, ayandas84 ayanda...@gmail.com wrote: We have a small apache spark cluster of 6 computers. We are trying to

Re: streaming: code to simulate a network socket data source

2014-09-09 Thread danilopds
I utilize this code in separated but the program block in this character: val socket = listener.accept() Do you have any suggestion? Thanks -- View this message in context:

spark on yarn history server + hdfs permissions issue

2014-09-09 Thread Greg Hill
I am running Spark on Yarn with the HDP 2.1 technical preview. I'm having issues getting the spark history server permissions to read the spark event logs from hdfs. Both sides are configured to write/read logs from: hdfs:///apps/spark/events The history server is running as user spark, the

Re: Spark processes not doing on killing corresponding YARN application

2014-09-09 Thread didata
I figured out this issue (in our case) ...And I'll vent a little in my reply here... =:)Fedora's well-intentioned firewall (firewall-cmd) requires you to open (enable) any port/services on a host that you need to connect to (including SSH/22 - which is enabled by default, of course). So when

spark-streaming Could not compute split exception

2014-09-09 Thread Penny Espinoza
Hey - I have a Spark 1.0.2 job (using spark-streaming-kafka) that runs successfully using master = local[4]. However, when I run it on a Hadoop 2.2 EMR cluster using master yarn-client, it fails after running for about 5 minutes. My main method does something like this: 1. gets streaming

Re: prepending jars to the driver class path for spark-submit on YARN

2014-09-09 Thread Penny Espinoza
I finally seem to have gotten past this issue. Here’s what I did: * rather than using the binary distribution, I built Spark from scratch to eliminate the 4.1 version of org.apache.httpcomponents from the assembly * git clone https://github.com/apache/spark.git * cd spark

Distributed Deep Learning Workshop with Scala, Akka, and Spark

2014-09-09 Thread Alexy Khrabrov
On September 25-26, SF Scala teams up with Adam Gibson, the creator of deeplearning4j.org, to teach the first ever Distributed Deep Learning with Scala Akka, and Spark workshop. Deep Learning is enabling break-through advances in the areas such as image recognition and natural language

Spark HiveQL support plan

2014-09-09 Thread XUE, Xiaohui
Hi, In Spark website, there’s a plan to support HiveQL on top of Spark SQL and also to support JDBC/ODBC. I would like to know if in this “HiveQL” supported by Spark (or Spark SQL accessible through JDBC/ODBC), is there a plan to add extensions to leverage different Spark features like

Re: spark-streaming Could not compute split exception

2014-09-09 Thread Marcelo Vanzin
This has all the symptoms of Yarn killing your executors due to them exceeding their memory limits. Could you check your RM/NM logs to see if that's the case? (The error was because of an executor at domU-12-31-39-0B-F1-D1.compute-1.internal, so you can check that NM's log file.) If that's the

spark.cleaner.ttl and spark.streaming.unpersist

2014-09-09 Thread Luis Ángel Vicente Sánchez
The executors of my spark streaming application are being killed due to memory issues. The memory consumption is quite high on startup because is the first run and there are quite a few events on the kafka queues that are consumed at a rate of 100K events per sec. I wonder if it's recommended to

Yarn Driver OOME (Java heap space) when executors request map output locations

2014-09-09 Thread jbeynon
I'm running on Yarn with relatively small instances with 4gb memory. I'm not caching any data but when the map stage ends and shuffling begins all of the executors request the map output locations at the same time which seems to kill the driver when the number of executors is turned up. For

Re: Yarn Driver OOME (Java heap space) when executors request map output locations

2014-09-09 Thread Marcelo Vanzin
Hi, Yes, this is a problem, and I'm not aware of any simple workarounds (or complex one for that matter). There are people working to fix this, you can follow progress here: https://issues.apache.org/jira/browse/SPARK-1239 On Tue, Sep 9, 2014 at 2:54 PM, jbeynon jbey...@gmail.com wrote: I'm

Re: Yarn Driver OOME (Java heap space) when executors request map output locations

2014-09-09 Thread jbeynon
Thanks Marcelo, that looks like the same thing. I'll follow the Jira ticket for updates. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Yarn-Driver-OOME-Java-heap-space-when-executors-request-map-output-locations-tp13827p13829.html Sent from the Apache

Re: spark-streaming Could not compute split exception

2014-09-09 Thread Penny Espinoza
The node manager log looks like this - not exactly sure what this means, but the container messages seem to indicate there is still plenty of memory. 2014-09-09 21:47:00,718 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of

Re: Yarn Driver OOME (Java heap space) when executors request map output locations

2014-09-09 Thread Kostas Sakellis
Hey, If you are interested in more details there is also a thread about this issue here: http://apache-spark-developers-list.1001551.n3.nabble.com/Eliminate-copy-while-sending-data-any-Akka-experts-here-td7127.html Kostas On Tue, Sep 9, 2014 at 3:01 PM, jbeynon jbey...@gmail.com wrote: Thanks

Re: clarification for some spark on yarn configuration options

2014-09-09 Thread Andrew Or
Hi Greg, SPARK_EXECUTOR_INSTANCES is the total number of workers in the cluster. The equivalent spark.executor.instances is just another way to set the same thing in your spark-defaults.conf. Maybe this should be documented. :) spark.yarn.executor.memoryOverhead is just an additional margin

Spark caching questions

2014-09-09 Thread Vladimir Rodionov
Hi, users 1. Disk based cache eviction policy? The same LRU? 2. What is the scope of a cached RDD? Does it survive application? What happen if I run Java app next time? Will RRD be created or read from cache? If , answer is YES, then ... 3. Is there are any way to invalidate cached RDD

Deregistered receiver for stream 0: Stopped by driver

2014-09-09 Thread Sing Yip
When I stop Spark Streaming Context by calling stop(), I always get the following error: ERROR Deregistered receiver for stream 0: Stopped by driver class=org.apache.spark.streaming.scheduler.ReceiverTracker WARN Stopped executor without error

Re: Crawler and Scraper with different priorities

2014-09-09 Thread Peng Cheng
Hi Sandeep, would you be interesting in joining my open source project? https://github.com/tribbloid/spookystuff IMHO spark is indeed not for general purpose crawling, of which distributed job is highly homogeneous. But good enough for directional scraping which involves heterogeneous input and

Spark + AccumuloInputFormat

2014-09-09 Thread Russ Weeks
Hi, I'm trying to execute Spark SQL queries on top of the AccumuloInputFormat. Not sure if I should be asking on the Spark list or the Accumulo list, but I'll try here. The problem is that the workload to process SQL queries doesn't seem to be distributed across my cluster very well. My Spark

Table not found: using jdbc console to query sparksql hive thriftserver

2014-09-09 Thread alexandria1101
Hi, I want to use the sparksql thrift server in my application and make sure everything is loading and working. I built Spark 1.1 SNAPSHOT and ran the thrift server using ./sbin/start-thrift-server. In my application I load tables into schemaRDDs and I expect that the thrift-server should pick

how to run python examples in spark 1.1?

2014-09-09 Thread freedafeng
I'm mostly interested in the hbase examples in the repo. I saw two examples hbase_inputformat.py and hbase_outputformat.py in the 1.1 branch. Can you show me how to run them? Compile step is done. I tried to run the examples, but failed. -- View this message in context:

Re: Cannot run SimpleApp as regular Java app

2014-09-09 Thread ericacm
Hi Yana - I added the following to spark-class: echo RUNNER: $RUNNER echo CLASSPATH: $CLASSPATH echo JAVA_OPTS: $JAVA_OPTS echo '$@': $@ Here's the output: $ ./spark-submit --class experiments.SimpleApp --master spark://myhost.local:7077

serialization changes -- OOM

2014-09-09 Thread Manku Timma
Has anything changed in the last 30 days w.r.t serialization? I had 620MB of compressed data which used to get serialized-in-spark-memory with 4GB executor memory. Now it fails to get serialized in memory even at 10GB of executor memory. -- Bharath

EOFException when reading from HDFS

2014-09-09 Thread kent
I ran the SimpleApp program from spark tutorial (https://spark.apache.org/docs/1.0.0/quick-start.html), which works fine. However, if I change the file location from local to hdfs, then I get an EOFException. I did some search online which suggests this error is caused by hadoop version

Re: Table not found: using jdbc console to query sparksql hive thriftserver

2014-09-09 Thread Du Li
Your tables were registered in the SqlContext, whereas the thrift server works with HiveContext. They seem to be in two different worlds today. On 9/9/14, 5:16 PM, alexandria1101 alexandria.shea...@gmail.com wrote: Hi, I want to use the sparksql thrift server in my application and make sure

Re: Records - Input Byte

2014-09-09 Thread Mayur Rustagi
What do you mean by control your input”, are you trying to pace your spark streaming by number of words. If so that is not supported as of now, you can only control time consume all files within that time period.  -- Regards, Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com

Re: How to set java.library.path in a spark cluster

2014-09-09 Thread qihong
Add something like following to spark-env.sh export LD_LIBRARY_PATH=path of libmosekjava7_0.so:$LD_LIBRARY_PATH (and remove all 5 exports you listed). Then restart all worker nodes, and try again. Good luck! -- View this message in context: