Re: How to set java.library.path in a spark cluster

2014-09-09 Thread qihong
Add something like following to spark-env.sh export LD_LIBRARY_PATH=:$LD_LIBRARY_PATH (and remove all 5 exports you listed). Then restart all worker nodes, and try again. Good luck! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-set-java-library-pa

Re: Records - Input Byte

2014-09-09 Thread Mayur Rustagi
What do you mean by "control your input”, are you trying to pace your spark streaming by number of words. If so that is not supported as of now, you can only control time & consume all files within that time period.  -- Regards, Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com

Re: Table not found: using jdbc console to query sparksql hive thriftserver

2014-09-09 Thread Du Li
You need to run mvn install so that the package you built is put into the local maven repo. Then when compiling your own app (with the right dependency specified), the package will be retrieved. On 9/9/14, 8:16 PM, "alexandria1101" wrote: >I think the package does not exist because I need to c

Re: how to choose right DStream batch interval

2014-09-09 Thread qihong
Hi Mayur, Thanks for your response. I did write a simple test that set up a DStream with 5 batches; The batch duration is 1 second, and the 3rd batch will take extra 2 seconds, the output of the test shows that the 3rd batch causes backlog, and spark streaming does catch up on 4th and 5th batch (

How to set java.library.path in a spark cluster

2014-09-09 Thread ayandas84
Hi, I am working on a 3 machine cloudera cluster. Whenever I submit a spark job as a jar file with native dependency on mosek it shows the following error. java.lang.UnsatisfiedLinkError: no mosekjava7_0 in java.library.path How should I set the java.library.path. I printed the environment varia

Re: how to setup steady state stream partitions

2014-09-09 Thread qihong
Thanks for your response. I do have something like: val inputDStream = ... val keyedDStream = inputDStream.map(...) // use sensorId as key val partitionedDStream = keyedDstream.transform(rdd => rdd.partitionBy(new MyPartitioner(...))) val stateDStream = partitionedDStream.updateStateByKey[...](ud

Re: how to setup steady state stream partitions

2014-09-09 Thread x
Using your own partitioner didn't work? e.g. YourRDD.partitionBy(new HashPartitioner(your number)) xj @ Tokyo On Wed, Sep 10, 2014 at 12:03 PM, qihong wrote: > I'm working on a DStream application. The input are sensors' measurements, > the data format is > > There are 10 thousands sensors,

Re: Table not found: using jdbc console to query sparksql hive thriftserver

2014-09-09 Thread alexandria1101
I think the package does not exist because I need to change the pom file: org.apache.spark spark-assembly_2.10 1.0.1 pom provided I changed the version number to 1.1.1, yet still that causes the build error: Failure to find org.apache.spark:spark-assembly_2.10:pom:1.1.1 in http

how to setup steady state stream partitions

2014-09-09 Thread qihong
I'm working on a DStream application. The input are sensors' measurements, the data format is There are 10 thousands sensors, and updateStateByKey is used to maintain the states of sensors, the code looks like following: val inputDStream = ... val keyedDStream = inputDStream.map(...) // use s

Re: Spark streaming for synchronous API

2014-09-09 Thread Tobias Pfeiffer
Hi again, On Tue, Sep 9, 2014 at 2:20 PM, Tobias Pfeiffer wrote: > > On Tue, Sep 9, 2014 at 2:02 PM, Ron's Yahoo! wrote: >> >> For example, let’s say there’s a particular topic T1 in a Kafka queue. >> If I have a new set of requests coming from a particular client A, I was >> wondering if I co

Re: Spark EC2 standalone - Utils.fetchFile no such file or directory

2014-09-09 Thread luanjunyi
I've encountered probably the same problem and just figured out the solution. The error was caused because Spark tried to write to the scratch directory but the path didn't exist. It's likely you are running the app on the master node only. In the spark-ec2 setting, the scratch directory for Spar

Re: Table not found: using jdbc console to query sparksql hive thriftserver

2014-09-09 Thread alexandria1101
Thanks so much! That makes complete sense. However, when I compile I get an error "package org.apache.spark.sql.hive does not exist." Does anyone else have this and any idea why this might be so? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Table-not-f

RE: spark.cleaner.ttl and spark.streaming.unpersist

2014-09-09 Thread Shao, Saisai
Hi Luis, The parameter “spark.cleaner.ttl” and “spark.streaming.unpersist” can be used to remove useless timeout streaming data, the difference is that “spark.cleaner.ttl” is time-based cleaner, it does not only clean streaming input data, but also Spark’s useless metadata; while “spark.stream

Re: Table not found: using jdbc console to query sparksql hive thriftserver

2014-09-09 Thread Du Li
Your tables were registered in the SqlContext, whereas the thrift server works with HiveContext. They seem to be in two different worlds today. On 9/9/14, 5:16 PM, "alexandria1101" wrote: >Hi, > >I want to use the sparksql thrift server in my application and make sure >everything is loading an

EOFException when reading from HDFS

2014-09-09 Thread kent
I ran the SimpleApp program from spark tutorial (https://spark.apache.org/docs/1.0.0/quick-start.html), which works fine. However, if I change the file location from local to hdfs, then I get an EOFException. I did some search online which suggests this error is caused by hadoop version conflic

serialization changes -- OOM

2014-09-09 Thread Manku Timma
Has anything changed in the last 30 days w.r.t serialization? I had 620MB of compressed data which used to get serialized-in-spark-memory with 4GB executor memory. Now it fails to get serialized in memory even at 10GB of executor memory. -- Bharath

Re: Cannot run SimpleApp as regular Java app

2014-09-09 Thread ericacm
Hi Yana - I added the following to spark-class: echo RUNNER: $RUNNER echo CLASSPATH: $CLASSPATH echo JAVA_OPTS: $JAVA_OPTS echo '$@': $@ Here's the output: $ ./spark-submit --class experiments.SimpleApp --master spark://myhost.local:7077 /IdeaProjects/spark-experiments/target/spark-experiments

how to run python examples in spark 1.1?

2014-09-09 Thread freedafeng
I'm mostly interested in the hbase examples in the repo. I saw two examples hbase_inputformat.py and hbase_outputformat.py in the 1.1 branch. Can you show me how to run them? Compile step is done. I tried to run the examples, but failed. -- View this message in context: http://apache-spark-u

Table not found: using jdbc console to query sparksql hive thriftserver

2014-09-09 Thread alexandria1101
Hi, I want to use the sparksql thrift server in my application and make sure everything is loading and working. I built Spark 1.1 SNAPSHOT and ran the thrift server using ./sbin/start-thrift-server. In my application I load tables into schemaRDDs and I expect that the thrift-server should pick th

Spark + AccumuloInputFormat

2014-09-09 Thread Russ Weeks
Hi, I'm trying to execute Spark SQL queries on top of the AccumuloInputFormat. Not sure if I should be asking on the Spark list or the Accumulo list, but I'll try here. The problem is that the workload to process SQL queries doesn't seem to be distributed across my cluster very well. My Spark SQL

Re: Crawler and Scraper with different priorities

2014-09-09 Thread Peng Cheng
Hi Sandeep, would you be interesting in joining my open source project? https://github.com/tribbloid/spookystuff IMHO spark is indeed not for general purpose crawling, of which distributed job is highly homogeneous. But good enough for directional scraping which involves heterogeneous input and

Deregistered receiver for stream 0: Stopped by driver

2014-09-09 Thread Sing Yip
When I stop Spark Streaming Context by calling stop(), I always get the following error: ERROR Deregistered receiver for stream 0: Stopped by driver class=org.apache.spark.streaming.scheduler.ReceiverTracker WARN Stopped executor without error class=org.apache.spark.streaming.receiver.Receive

Spark caching questions

2014-09-09 Thread Vladimir Rodionov
Hi, users 1. Disk based cache eviction policy? The same LRU? 2. What is the scope of a cached RDD? Does it survive application? What happen if I run Java app next time? Will RRD be created or read from cache? If , answer is YES, then ... 3. Is there are any way to invalidate cached RDD automat

Re: clarification for some spark on yarn configuration options

2014-09-09 Thread Andrew Or
Hi Greg, SPARK_EXECUTOR_INSTANCES is the total number of workers in the cluster. The equivalent "spark.executor.instances" is just another way to set the same thing in your spark-defaults.conf. Maybe this should be documented. :) "spark.yarn.executor.memoryOverhead" is just an additional margin a

Re: spark-streaming "Could not compute split" exception

2014-09-09 Thread Marcelo Vanzin
Your executor is exiting or crashing unexpectedly: On Tue, Sep 9, 2014 at 3:13 PM, Penny Espinoza wrote: > org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exit > code from container container_1410224367331_0006_01_03 is : 1 > 2014-09-09 21:47:26,345 WARN > org.apache.hadoo

Re: Yarn Driver OOME (Java heap space) when executors request map output locations

2014-09-09 Thread Kostas Sakellis
Hey, If you are interested in more details there is also a thread about this issue here: http://apache-spark-developers-list.1001551.n3.nabble.com/Eliminate-copy-while-sending-data-any-Akka-experts-here-td7127.html Kostas On Tue, Sep 9, 2014 at 3:01 PM, jbeynon wrote: > Thanks Marcelo, that lo

Re: spark-streaming "Could not compute split" exception

2014-09-09 Thread Penny Espinoza
The node manager log looks like this - not exactly sure what this means, but the container messages seem to indicate there is still plenty of memory. 2014-09-09 21:47:00,718 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTre

Re: Yarn Driver OOME (Java heap space) when executors request map output locations

2014-09-09 Thread jbeynon
Thanks Marcelo, that looks like the same thing. I'll follow the Jira ticket for updates. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Yarn-Driver-OOME-Java-heap-space-when-executors-request-map-output-locations-tp13827p13829.html Sent from the Apache Spar

Re: Yarn Driver OOME (Java heap space) when executors request map output locations

2014-09-09 Thread Marcelo Vanzin
Hi, Yes, this is a problem, and I'm not aware of any simple workarounds (or complex one for that matter). There are people working to fix this, you can follow progress here: https://issues.apache.org/jira/browse/SPARK-1239 On Tue, Sep 9, 2014 at 2:54 PM, jbeynon wrote: > I'm running on Yarn with

Yarn Driver OOME (Java heap space) when executors request map output locations

2014-09-09 Thread jbeynon
I'm running on Yarn with relatively small instances with 4gb memory. I'm not caching any data but when the map stage ends and shuffling begins all of the executors request the map output locations at the same time which seems to kill the driver when the number of executors is turned up. For exampl

spark.cleaner.ttl and spark.streaming.unpersist

2014-09-09 Thread Luis Ángel Vicente Sánchez
The executors of my spark streaming application are being killed due to memory issues. The memory consumption is quite high on startup because is the first run and there are quite a few events on the kafka queues that are consumed at a rate of 100K events per sec. I wonder if it's recommended to u

Re: spark-streaming "Could not compute split" exception

2014-09-09 Thread Marcelo Vanzin
This has all the symptoms of Yarn killing your executors due to them exceeding their memory limits. Could you check your RM/NM logs to see if that's the case? (The error was because of an executor at domU-12-31-39-0B-F1-D1.compute-1.internal, so you can check that NM's log file.) If that's the ca

Spark HiveQL support plan

2014-09-09 Thread XUE, Xiaohui
Hi, In Spark website, there’s a plan to support HiveQL on top of Spark SQL and also to support JDBC/ODBC. I would like to know if in this “HiveQL” supported by Spark (or Spark SQL accessible through JDBC/ODBC), is there a plan to add extensions to leverage different Spark features like machin

Distributed Deep Learning Workshop with Scala, Akka, and Spark

2014-09-09 Thread Alexy Khrabrov
On September 25-26, SF Scala teams up with Adam Gibson, the creator of deeplearning4j.org, to teach the first ever Distributed Deep Learning with Scala Akka, and Spark workshop. Deep Learning is enabling break-through advances in the areas such as image recognition and natural language processing.

Re: prepending jars to the driver class path for spark-submit on YARN

2014-09-09 Thread Penny Espinoza
I finally seem to have gotten past this issue. Here’s what I did: * rather than using the binary distribution, I built Spark from scratch to eliminate the 4.1 version of org.apache.httpcomponents from the assembly * git clone https://github.com/apache/spark.git * cd spark

spark-streaming "Could not compute split" exception

2014-09-09 Thread Penny Espinoza
Hey - I have a Spark 1.0.2 job (using spark-streaming-kafka) that runs successfully using master = local[4]. However, when I run it on a Hadoop 2.2 EMR cluster using master yarn-client, it fails after running for about 5 minutes. My main method does something like this: 1. gets streaming

Re: Spark processes not doing on killing corresponding YARN application

2014-09-09 Thread didata
I figured out this issue (in our case) ...And I'll vent a little in my reply here... =:)Fedora's well-intentioned firewall (firewall-cmd) requires you to open (enable) any port/services on a host that you need to connect to (including SSH/22 - which is enabled by default, of course). So when launch

spark on yarn history server + hdfs permissions issue

2014-09-09 Thread Greg Hill
I am running Spark on Yarn with the HDP 2.1 technical preview. I'm having issues getting the spark history server permissions to read the spark event logs from hdfs. Both sides are configured to write/read logs from: hdfs:///apps/spark/events The history server is running as user spark, the j

Re: streaming: code to simulate a network socket data source

2014-09-09 Thread danilopds
I utilize this code in separated but the program block in this character: val socket = listener.accept() Do you have any suggestion? Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/streaming-code-to-simulate-a-network-socket-data-source-tp3431p13817.

Re: Problem in running mosek in spark cluster - java.lang.UnsatisfiedLinkError: no mosekjava7_0 in java.library.path at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1738)

2014-09-09 Thread Yana Kadiyska
If that library has native dependencies you'd need to make sure that the native code is on all boxes and in the path with export SPARK_LIBRARY_PATH=... On Tue, Sep 9, 2014 at 10:17 AM, ayandas84 wrote: > We have a small apache spark cluster of 6 computers. We are trying to > solve > a distribut

Re: Cannot run SimpleApp as regular Java app

2014-09-09 Thread Yana Kadiyska
spark-submit is a script which calls spark-class script. Can you output the command that spark-class runs (say, by putting set -x before the very last line?). You should see the java command that is being run. The scripts do some parameter setting so it's possible you're missing something. It seems

Re: streaming: code to simulate a network socket data source

2014-09-09 Thread danilopds
Hello Diana, How can I include this implementation in my code, in terms of start this task together the NetworkWordCount. In my case, I have a directory with several files. Then, I include this line: StreamingDataGenerator.streamingGenerator(NetPort, BytesSecond, DirFiles) But the program sta

Re: PySpark on Yarn - how group by data properly

2014-09-09 Thread Davies Liu
On Tue, Sep 9, 2014 at 9:56 AM, Oleg Ruchovets wrote: > Hi , > >I came from map/reduce background and try to do quite trivial thing: > > I have a lot of files ( on hdfs ) - format is : > >1 , 2 , 3 >2 , 3 , 5 >1 , 3, 5 > 2, 3 , 4 > 2 , 5, 1 > > I am actually need to grou

Re: Accuracy hit in classification with Spark

2014-09-09 Thread jatinpreet
I have also ran some tests on the other algorithms available with MLlib but got dismal accuracy. Is the method of creating LabeledPoint RDD different for other algorithms such as, LinearRegressionWithSGD? Any help is appreciated. - Novice Big Data Programmer -- View this message in context:

Re: Accuracy hit in classification with Spark

2014-09-09 Thread jatinpreet
Thanks for the information Xiangrui. I am using the following example to classify documents. http://chimpler.wordpress.com/2014/06/11/classifiying-documents-using-naive-bayes-on-apache-spark-mllib/ I am not sure if this is the best way to convert textual data into vectors. Can you please confirm

Re: RDD memory questions

2014-09-09 Thread Davies Liu
On Tue, Sep 9, 2014 at 10:07 AM, Boxian Dong wrote: > I currently working on a machine learning project, which require the RDDs' > content to be (mostly partially) updated during each iteration. Because the > program will be converted directly from "traditional" python object-oriented > code, the

Re: groupBy gives non deterministic results

2014-09-09 Thread Davies Liu
Which version of Spark are you using? This bug had been fixed in 0.9.2, 1.0.2 and 1.1, could you upgrade to one of these versions to verify it? Davies On Tue, Sep 9, 2014 at 7:03 AM, redocpot wrote: > Thank you for your replies. > > More details here: > > The prog is executed on local mode (sin

Re: Filter function problem

2014-09-09 Thread Burak Yavuz
Hi, val test = persons.value .map{tuple => (tuple._1, tuple._2 .filter{event => *inactiveIDs.filter(event2 => event2._1 == tuple._1).count() != 0})} Your problem is right between the asterisk. You can't make an RDD operation inside an RDD operation, because RDD's can't be serialized

Re: Filter function problem

2014-09-09 Thread Daniel Siegmann
You should not be broadcasting an RDD. You also should not be passing an RDD in a lambda to another RDD. If you want, can call RDD.collect and then broadcast those values (of course you must be able to fit all those values in memory). On Tue, Sep 9, 2014 at 6:34 AM, Blackeye wrote: > In order to

Re: If for YARN you use 'spark.yarn.jar', what is the LOCAL equivalent to that property ...

2014-09-09 Thread Marcelo Vanzin
Yes, that's how file: URLs are interpreted everywhere in Spark. (It's also explained in the link to the docs I posted earlier.) The second interpretation below is "local:" URLs in Spark, but that doesn't work with Yarn on Spark 1.0 (so it won't work with CDH 5.1 and older either). On Mon, Sep 8,

RDD memory questions

2014-09-09 Thread Boxian Dong
I currently working on a machine learning project, which require the RDDs' content to be (mostly partially) updated during each iteration. Because the program will be converted directly from "traditional" python object-oriented code, the content of the RDD will be modified in the mapping function.

PySpark on Yarn - how group by data properly

2014-09-09 Thread Oleg Ruchovets
Hi , I came from map/reduce background and try to do quite trivial thing: I have a lot of files ( on hdfs ) - format is : 1 , 2 , 3 2 , 3 , 5 1 , 3, 5 2, 3 , 4 2 , 5, 1 I am actually need to group by key (first column) : key values 1 --> (2,3),(3,5) 2 --> (3,5),(3

Re: Querying a parquet file in s3 with an ec2 install

2014-09-09 Thread Jim Carroll
Okay, This seems to be either a code version issue or a communication issue. It works if I execute the spark shell from the master node. It doesn't work if I run it from my laptop and connect to the master node. I had opened the ports for the WebUI (8080) and the cluster manager (7077) for the m

Re: Huge matrix

2014-09-09 Thread Reza Zadeh
Better to do it in a PR of your own, it's not sufficiently related to dimsum On Tue, Sep 9, 2014 at 7:03 AM, Debasish Das wrote: > Cool...can I add loadRowMatrix in your PR ? > > Thanks. > Deb > > On Tue, Sep 9, 2014 at 1:14 AM, Reza Zadeh wrote: > >> Hi Deb, >> >> Did you mean to message me in

Re: Accuracy hit in classification with Spark

2014-09-09 Thread Xiangrui Meng
If you are using the Mahout's Multinomial Naive Bayes, it should be the same as MLlib's. I tried MLlib with news20.scale downloaded from http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass.html and the test accuracy is 82.4%. -Xiangrui On Tue, Sep 9, 2014 at 4:58 AM, jatinpreet wrot

Problem in running mosek in spark cluster - java.lang.UnsatisfiedLinkError: no mosekjava7_0 in java.library.path at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1738)

2014-09-09 Thread ayandas84
We have a small apache spark cluster of 6 computers. We are trying to solve a distributed problem which requires solving a optimization problem at each machine during a spark map operation. We decided to use mosek as the solver and I collected an academic license to this end. We observed that mos

Re: Huge matrix

2014-09-09 Thread Debasish Das
Cool...can I add loadRowMatrix in your PR ? Thanks. Deb On Tue, Sep 9, 2014 at 1:14 AM, Reza Zadeh wrote: > Hi Deb, > > Did you mean to message me instead of Xiangrui? > > For TS matrices, dimsum with positiveinfinity and computeGramian have the > same cost, so you can do either one. For dense

Re: groupBy gives non deterministic results

2014-09-09 Thread redocpot
Thank you for your replies. More details here: The prog is executed on local mode (single node). Default env params are used. The test code and the result are in this gist: https://gist.github.com/coderh/0147467f0b185462048c Here is 10 first lines of the data: 3 fields each row, the delimiter i

Re: spark functionality similar to hadoop's RecordWriter close method

2014-09-09 Thread Sean Owen
You're mixing the Java and Scala APIs here. Your call to foreach() is expecting a Scala function and you're giving it a Java Function. Ideally you just use the Scala API, of course. Before explaining how to actually use a Java function here, maybe clarify that you have to do it and can't use Scala

spark functionality similar to hadoop's RecordWriter close method

2014-09-09 Thread robertberta
I want to call a function for batches of elements from an rdd val javaClass:org.apache.spark.api.java.function.Function[Seq[String],Unit] = new JavaClass() rdd.mapPartitions(_.grouped(5)).foreach(javaClass) 1.This worked fine in spark 0.9.1 , when we upgrade to spark 1.0.2 , Function changed from

Re: Accuracy hit in classification with Spark

2014-09-09 Thread jatinpreet
Hi, I tried running the classification program on the famous newsgroup data. This had an even more drastic effect on the accuracy, as it dropped from ~82% in Mahout to ~72% in Spark MLlib. Please help me in this regard as I have to use Spark in a production system very soon and this is a blocker

Re: Accuracy hit in classification with Spark

2014-09-09 Thread jatinpreet
Hi, I tried running the classification program on the famous newsgroup data. This had an even more drastic effect on the accuracy, as it dropped from ~82% in Mahout to ~72% in Spark MLlib. Please help me in this regard as I have to use Spark in a production system very soon and this is a blocker

Re: Querying a parquet file in s3 with an ec2 install

2014-09-09 Thread Jim Carroll
>Why I think its the number of files is that I believe that a > all of those or large part of those files are read when >you run sqlContext.parquetFile() and the time it would >take in s3 for that to happen is a lot so something >internally is timing out.. I'll create the parquet files with D

Re: Querying a parquet file in s3 with an ec2 install

2014-09-09 Thread Jim Carroll
My apologies to the list. I replied to Manu's question and it went directly to him rather than the list. In case anyone else has this issue here is my reply and Manu's reply to me. This also answers Ian's question. --- Hi Manu, The dataset is 7.5 million rows

Re: Filter function problem

2014-09-09 Thread Blackeye
In order to help anyone to answer i could say that i checked the inactiveIDs.filter operation seperated, and I found that it doesn't return null in any case. In addition i don't how to handle (or check) whether a RDD is null. I find the debugging to complicated to point the error. Any ideas how to

Filter function problem

2014-09-09 Thread Blackeye
I have the following code written in scala in Spark: (inactiveIDs is a RDD[(Int, Seq[String])], persons is a Broadcast[RDD[(Int, Seq[Event])]] and Event is a class that I have created) val test = persons.value .map{tuple => (tuple._1, tuple._2 .filter{event => inactiveIDs.filter(event2 => eve

RE: Spark streaming: size of DStream

2014-09-09 Thread julyfire
yes, I agree to directly transform on DStream even there is no data injected in this batch duration. while my situation is : Spark receive flume stream continurously, and I use updateStateByKey function to collect data for a key among several batches, then I will handle the collected data after wai

RE: Spark streaming: size of DStream

2014-09-09 Thread Shao, Saisai
I think you should clarify some things in Spark Streaming: 1. closure in map is running in the remote side, so modify count var will only take effect in remote side. You will always get -1 in driver side. 2. some codes in closure in foreachRDD is lazily executed in each batch duration, while the

Re: Spark streaming: size of DStream

2014-09-09 Thread Luis Ángel Vicente Sánchez
If you take into account what streaming means in spark, your goal doesn't really make sense; you have to assume that your streams are infinite and you will have to process them till the end of the days. Operations on a DStream define what you want to do with each element of each RDD, but spark stre

Re: groupBy gives non deterministic results

2014-09-09 Thread Ye Xianjin
Can you provide small sample or test data that reproduce this problem? and what's your env setup? single node or cluster? Sent from my iPhone > On 2014年9月8日, at 22:29, redocpot wrote: > > Hi, > > I have a key-value RDD called rdd below. After a groupBy, I tried to count > rows. > But the resu

RE: Spark streaming: size of DStream

2014-09-09 Thread julyfire
i'm sorry I have some error in my code, update here: var count = -1L // a global variable in the main object val currentBatch = some_DStream val countDStream = currentBatch.map(o=>{ count = 0L // reset the count variable in each batch o }) countDStream.foreachRDD(rdd=> coun

Re: Huge matrix

2014-09-09 Thread Reza Zadeh
Hi Deb, Did you mean to message me instead of Xiangrui? For TS matrices, dimsum with positiveinfinity and computeGramian have the same cost, so you can do either one. For dense matrices with say, 1m columns this won't be computationally feasible and you'll want to start sampling with dimsum. It

Re: Spark driver application can not connect to Spark-Master

2014-09-09 Thread niranda
Hi, I had the same issue in my Java code while I was trying to connect to a locally hosted spark server (using sbin/start-all.sh etc) using an IDE (IntelliJ). I packaged my app into a jar and used spark-submit (in bin/) and it worked! Hope this helps Rgds -- View this message in context:

RE: Spark streaming: size of DStream

2014-09-09 Thread julyfire
Thanks all, yes, i did using foreachRDD, the following is my code: var count = -1L // a global variable in the main object val currentBatch = some_DStream val countDStream = currentBatch.map(o=>{ *count = 0L *// reset the count variable in each batch o }) countDStream.foreachRDD

RE: Spark streaming: size of DStream

2014-09-09 Thread Shao, Saisai
Hi, I think all the received stream will generate a RDD in each batch duration even there is no data feed in (an empty RDD will be generated). So you cannot use number of RDD to judge whether there is any data received. One way is to do this in DStream/foreachRDD(), like a.foreachRDD { r => if

Re: Spark streaming: size of DStream

2014-09-09 Thread Sean Owen
How about calling foreachRDD, and processing whatever data is in each RDD normally, and also keeping track within the foreachRDD function of whether any RDD had a count() > 0? if not, then you can execute at the end your alternate logic in the case of no data. I don't think you want to operate at t

RE: Spark streaming: size of DStream

2014-09-09 Thread julyfire
Hi Jerry, Thanks for your reply. I use spark streaming to receive the flume stream, then I need to do a judgement, in each batchDuration, if the received stream has data, then I should do something, if no data, do the other thing. Then I thought the count() can give me the measure, but it returns

RE: Spark streaming: size of DStream

2014-09-09 Thread Shao, Saisai
Hi, Is there any specific scenario which needs to know the RDD numbers in the DStream? According to my knowledge DStream will generate one RDD in each right batchDuration, some old rdd will be remembered for windowing-like function, and will be removed when useless. The hashmap generatedRDDs i

Accuracy hit in classification with Spark

2014-09-09 Thread jatinpreet
Hi, I had been using Mahout's Naive Bayes algorithm to classify document data. For a specific train and test set, I was getting accuracy in the range of 86%. When I shifted to Spark's MLlib, the accuracy dropped to the vicinity of 82%. I am using same version of Lucene and logic to generate TFIDF

Re: groupBy gives non deterministic results

2014-09-09 Thread Davies Liu
What's the type of the key? If the hash of key is different across slaves, then you could get this confusing results. We had met this similar results in Python, because of hash of None is different across machines. Davies On Mon, Sep 8, 2014 at 8:16 AM, redocpot wrote: > Update: > > Just test w

Re: Huge matrix

2014-09-09 Thread Debasish Das
Hi Xiangrui, For tall skinny matrices, if I can pass a similarityMeasure to computeGrammian, I could re-use the SVD's computeGrammian for similarity computation as well... Do you recommend using this approach for tall skinny matrices or just use the dimsum's routines ? Right now RowMatrix does n