OFF_HEAP storage level

2014-07-03 Thread Ajay Srivastava
Hi, I was checking different storage level of an RDD and found OFF_HEAP. Has anybody used this level ? If i use this level, where will data be stored ? If not in heap, does it mean that we can avoid GC ? How can I use this level ? I did not find anything in archive regarding this. Can someone also

Re: No FileSystem for scheme: hdfs

2014-07-03 Thread Akhil Das
​Most likely you are missing the hadoop configuration files (present in conf/*.xml).​ Thanks Best Regards On Fri, Jul 4, 2014 at 7:38 AM, Steven Cox wrote: > They weren't. They are now and the logs look a bit better - like perhaps > some serialization is completing that wasn't before. > > Bu

SparkSQL with Streaming RDD

2014-07-03 Thread Chang Lim
Would appreciate help on: 1. How to convert streaming RDD into JavaSchemaRDD 2. How to structure the driver program to do interactive SparkSQL Using Spark 1.0 with Java. I have steaming code that does upateStateByKey resulting in JavaPairDStream. I am using JavaDStream::compute(time) to get Java

Re: graphx Joining two VertexPartitions with different indexes is slow.

2014-07-03 Thread Ankur Dave
Oh, I just read your message more carefully and noticed that you're joining a regular RDD with a VertexRDD. In that case I'm not sure why the warning is occurring, but it might be worth caching both operands (graph.vertices and the regular RDD) just to be sure. Ankur

Re: graphx Joining two VertexPartitions with different indexes is slow.

2014-07-03 Thread Ankur Dave
A common reason for the "Joining ... is slow" message is that you're joining VertexRDDs without having cached them first. This will cause Spark to recompute unnecessarily, and as a side effect, the same index will get created twice and GraphX won't be able to do an efficient zip join. For example,

RE: No FileSystem for scheme: hdfs

2014-07-03 Thread Steven Cox
They weren't. They are now and the logs look a bit better - like perhaps some serialization is completing that wasn't before. But I still get the same error periodically. Other thoughts? From: Soren Macbeth [so...@yieldbot.com] Sent: Thursday, July 03, 2014 9:54

Re: No FileSystem for scheme: hdfs

2014-07-03 Thread Soren Macbeth
Are the hadoop configuration files on the classpath for your mesos executors? On Thu, Jul 3, 2014 at 6:45 PM, Steven Cox wrote: > ...and a real subject line. > -- > *From:* Steven Cox [s...@renci.org] > *Sent:* Thursday, July 03, 2014 9:21 PM > *To:* user@spark.apa

No FileSystem for scheme: hdfs

2014-07-03 Thread Steven Cox
...and a real subject line. From: Steven Cox [s...@renci.org] Sent: Thursday, July 03, 2014 9:21 PM To: user@spark.apache.org Subject: Folks, I have a program derived from the Kafka streaming wordcount example which works fine standalone. Running on Mesos is no

[no subject]

2014-07-03 Thread Steven Cox
Folks, I have a program derived from the Kafka streaming wordcount example which works fine standalone. Running on Mesos is not working so well. For starters, I get the error below "No FileSystem for scheme: hdfs". I've looked at lots of promising comments on this issue so now I have - * Eve

Re: Kafka - streaming from multiple topics

2014-07-03 Thread Tobias Pfeiffer
Sergey, On Fri, Jul 4, 2014 at 1:06 AM, Sergey Malov wrote: > > On the other hand, under the hood KafkaInputDStream which is create with > this KafkaUtils call, calls ConsumerConnector.createMessageStream which > returns a Map[String, List[KafkaStream] keyed by topic. It is, however, not > expo

Re: Case class in java

2014-07-03 Thread Kevin Jung
This will load listed jars when SparkContext is created. In case of REPL, we define and import classes after SparkContext created. According to above mentioned site, Executor install class loader in 'addReplClassLoaderIfNeeded' method using "spark.repl.class.uri" configuration. Then I will try to m

Re: Execution stalls in LogisticRegressionWithSGD

2014-07-03 Thread Xiangrui Meng
The feature dimension is small. You don't need a big akka.frameSize. The default one (10M) should be sufficient. Did you cache the data before calling LRWithSGD? -Xiangrui On Thu, Jul 3, 2014 at 10:02 AM, Bharath Ravi Kumar wrote: > I tried another run after setting the driver memory to 8G (and >

Re: Sample datasets for MLlib and Graphx

2014-07-03 Thread Nick Pentreath
The Kaggle data is not in libsvm format so you'd have to do some transformation. The Criteo and KDD cup datasets are if I recall fairly large. Criteo ad prediction data is around 2-3GB compressed I think. To my knowledge these are the largest binary classification datasets I've come across

Re: Sample datasets for MLlib and Graphx

2014-07-03 Thread AlexanderRiggers
Nick Pentreath wrote > Take a look at Kaggle competition datasets > - https://www.kaggle.com/competitions I was looking for files in LIBSVM format and never found something on Kaggle in bigger size. Most competitions I ve seen need data processing and feature generating, but maybe I ve to take a s

Re: Sample datasets for MLlib and Graphx

2014-07-03 Thread Nick Pentreath
Take a look at Kaggle competition datasets - https://www.kaggle.com/competitions For svm there are a couple of ad click prediction datasets of pretty large size. For graph stuff the SNAP has large network data: https://snap.stanford.edu/data/ — Sent from Mailbox On Thu, Jul 3, 2014 at 3

Sample datasets for MLlib and Graphx

2014-07-03 Thread AlexanderRiggers
Hello! I want to play around with several different cluster settings and measure performances for MLlib and GraphX and was wondering if anybody here could hit me up with datasets for these applications from 5GB onwards? I mostly interested in SVM and Triangle Count, but would be glad for any he

Re: Running the BroadcastTest.scala with TorrentBroadcastFactory in a standalone cluster

2014-07-03 Thread Mosharaf Chowdhury
Hi Jack, 1. Several previous instances of "key not valid?" error had been attributed to memory issues, either memory allocated per executor or per task, depending on the context. You can google it to see some examples. 2. I think your case is similar, even though its happening due to

RE: Lost TID: Loss was due to fetch failure from BlockManagerId

2014-07-03 Thread Mohammed Guller
Thanks, guys. It turned out to be a firewall issue. All the worker nodes had iptables enabled, which only allowed connection from worker to the master node on the standard 7070 port. Once I opened up the other ports, it is working now. Mohammed From: Mayur Rustagi [mailto:mayur.rust...@gmail.c

Re: Which version of Hive support Spark & Shark

2014-07-03 Thread Michael Armbrust
Spark SQL is based on Hive 0.12.0. On Thu, Jul 3, 2014 at 2:29 AM, Ravi Prasad wrote: > Hi , > Can any one please help me to understand which version of Hive support > Spark and Shark > > -- > -- > Regards, > RAVI PRASAD. T >

Re: LIMIT with offset in SQL queries

2014-07-03 Thread Michael Armbrust
Doing an offset is actually pretty expensive in a distributed query engine, so in many cases it probably makes sense to just collect and then perform the offset as you are doing now. This is unless the offset is very large. Another limitation here is that HiveQL does not support OFFSET. That sai

RE: How to use groupByKey and CqlPagingInputFormat

2014-07-03 Thread Mohammed Guller
Martin, 1) The first map contains the columns in the primary key, which could be a compound primary key containing multiple columns, and the second map contains all the non-key columns. 2) try this fixed code: val navnrevmap = casRdd.map{ case (key, value) => (ByteBufferUtil.s

Re: Anaconda Spark AMI

2014-07-03 Thread Jey Kottalam
Hi Ben, Has the PYSPARK_PYTHON environment variable been set in spark/conf/spark-env.sh to the path of the new python binary? FYI, there's a /root/copy-dirs script that can be handy when updating files on an already-running cluster. You'll want to restart the spark cluster for the changes to take

Spark logging strategy on YARN

2014-07-03 Thread Kostiantyn Kudriavtsev
Hi all, Could you please share your the best practices on writing logs in Spark? I’m running it on YARN, so when I check logs I’m bit confused… Currently, I’m writing System.err.println to put a message in log and access it via YARN history server. But, I don’t like this way… I’d like to use l

Re: Spark Streaming Error Help -> ERROR actor.OneForOneStrategy: key not found:

2014-07-03 Thread jschindler
I think I have found my answers but if anyone has thoughts please share. After testing for a while I think the error doesn't have any effect on the process. I think it is the case that there must be elements left in the window from last run otherwise my system is completely whack. Please let me

Anaconda Spark AMI

2014-07-03 Thread Benjamin Zaitlen
Hi All, I'm a dev a Continuum and we are developing a fair amount of tooling around Spark. A few days ago someone expressed interest in numpy+pyspark and Anaconda came up as a reasonable solution. I spent a number of hours yesterday trying to rework the base Spark AMI on EC2 but sadly was defeat

Re: Run spark unit test on Windows 7

2014-07-03 Thread Denny Lee
Thanks! will take a look at this later today. HTH! > On Jul 3, 2014, at 11:09 AM, Kostiantyn Kudriavtsev > wrote: > > Hi Denny, > > just created https://issues.apache.org/jira/browse/SPARK-2356 > >> On Jul 3, 2014, at 7:06 PM, Denny Lee wrote: >> >> Hi Konstantin, >> >> Could you please

Re: Run spark unit test on Windows 7

2014-07-03 Thread Kostiantyn Kudriavtsev
Hi Denny, just created https://issues.apache.org/jira/browse/SPARK-2356 On Jul 3, 2014, at 7:06 PM, Denny Lee wrote: > Hi Konstantin, > > Could you please create a jira item at: > https://issues.apache.org/jira/browse/SPARK/ so this issue can be tracked? > > Thanks, > Denny > > > On July 2

Spark Streaming Error Help -> ERROR actor.OneForOneStrategy: key not found:

2014-07-03 Thread jschindler
I am getting this "ERROR actor.OneForOneStrategy: key not found:" exception when I run my code and I'm not sure where it is looking for a key. My set up is I send packets to a third party service which then uses a webhook to hit one of our servers, which then logs it using kafka. I am just trying

Re: MLLib : Math on Vector and Matrix

2014-07-03 Thread Dmitriy Lyubimov
On Wed, Jul 2, 2014 at 11:40 PM, Xiangrui Meng wrote: > Hi Dmitriy, > > It is sweet to have the bindings, but it is very easy to downgrade the > performance with them. The BLAS/LAPACK APIs have been there for more > than 20 years and they are still the top choice for high-performance > linear alg

Re: issue with running example code

2014-07-03 Thread Gurvinder Singh
Just to provide more information on this issue. It seems that SPARK_HOME environment variable is causing the issue. If I unset the variable in spark-class script and run in the local mode my code runs fine without the exception. But if I run with SPARK_HOME, I get the exception mentioned below. I c

spark text processing

2014-07-03 Thread M Singh
Hi: Is there a way to find out when spark has finished processing a text file (both for streaming and non-streaming cases) ? Also, after processing, can spark copy the file to another directory ? Thanks

Re: write event logs with YARN

2014-07-03 Thread Andrew Or
Hi Christophe, another Andrew speaking. Your configuration looks fine to me. From the stack trace it seems that we are in fact closing the file system pre-maturely elsewhere in the system, such that when it tries to write the APPLICATION_COMPLETE file it throws the exception you see. This does loo

reading compress lzo files

2014-07-03 Thread Gurvinder Singh
Hi all, I am trying to read the lzo files. It seems spark recognizes that the input file is compressed and got the decompressor as 14/07/03 18:11:01 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library 14/07/03 18:11:01 INFO lzo.LzoCodec: Successfully loaded & initialized native-lzo library [h

Re: reduceByKey Not Being Called by Spark Streaming

2014-07-03 Thread Dan H.
Hi All, I was able to resolve this matter with a simple fix. It seems that in order to process a reduceByKey and the flat map operations at the same time, the only way to resolve was to increase the number of threads to > 1. Since I'm developing on my personal machine for speed, I simply updated

Re: Kafka - streaming from multiple topics

2014-07-03 Thread Sergey Malov
That’s an obvious workaround, yes, thank you Tobias. However, I’m prototyping substitution to real batch process, where I’d have to create six streams (and possibly more). It could be a bit messy. On the other hand, under the hood KafkaInputDStream which is create with this KafkaUtils call, cal

Re: Run spark unit test on Windows 7

2014-07-03 Thread Denny Lee
Hi Konstantin, Could you please create a jira item at:  https://issues.apache.org/jira/browse/SPARK/ so this issue can be tracked? Thanks, Denny On July 2, 2014 at 11:45:24 PM, Konstantin Kudryavtsev (kudryavtsev.konstan...@gmail.com) wrote: It sounds really strange... I guess it is a bug, c

Running the BroadcastTest.scala with TorrentBroadcastFactory in a standalone cluster

2014-07-03 Thread jackxucs
Hello, I am running the BroadcastTest example in a standalone cluster using spark-submit. I have 8 host machines and made Host1 the master. Host2 to Host8 act as 7 workers to connect to the master. The connection was fine as I could see all 7 hosts on the master web ui. The BroadcastTest example w

Re: [mllib] strange/buggy results with RidgeRegressionWithSGD

2014-07-03 Thread Eustache DIEMERT
Printing the model show the intercept is always 0 :( Should I open a bug for that ? 2014-07-02 16:11 GMT+02:00 Eustache DIEMERT : > Hi list, > > I'm benchmarking MLlib for a regression task [1] and get strange results. > > Namely, using RidgeRegressionWithSGD it seems the predicted points miss

Re: Error: UnionPartition cannot be cast to org.apache.spark.rdd.HadoopPartition

2014-07-03 Thread Honey Joshi
On Wed, July 2, 2014 2:00 am, Mayur Rustagi wrote: > two job context cannot share data, are you collecting the data to the > master & then sending it to the other context? > > Mayur Rustagi > Ph: +1 (760) 203 3257 > http://www.sigmoidanalytics.com > @mayur_rustagi

Re: Reading text file vs streaming text files

2014-07-03 Thread Akhil Das
Hi Singh! For this use-case its better to have a Streaming context listening to that directory in hdfs where the files are being dropped and you can set the Streaming interval as 15 minutes and let this driver program run continuously, so as soon as new files are arrived they are taken for process

matchError:null in ALS.train

2014-07-03 Thread Honey Joshi
Hi All, We are using ALS.train to generate a model for predictions. We are using DStream[] to collect the predicted output and then trying to dump in a text file using these two approaches dstream.saveAsTextFiles() and dstream.foreachRDD(rdd=>rdd.saveAsTextFile).But both these approaches are givin

Reading text file vs streaming text files

2014-07-03 Thread M Singh
Hi: I am working on a project where a few thousand text files (~20M in size) will be dropped in an hdfs directory every 15 minutes.  Data from the file will used to update counters in cassandra (non-idempotent operation).  I was wondering what is the best to deal with this: * Use text s

Re: SparkKMeans.scala from examples will show: NoClassDefFoundError: breeze/linalg/Vector

2014-07-03 Thread Wanda Hawk
I have given this a try in a spark-shell and I still get many "Allocation Failure"s On Thursday, July 3, 2014 9:51 AM, Xiangrui Meng wrote: The SparkKMeans is just an example code showing a barebone implementation of k-means. To run k-means on big datasets, please use the KMeans implemented

Re: Enable Parsing Failed or Incompleted jobs on HistoryServer (YARN mode)

2014-07-03 Thread Surendranauth Hiraman
I've had some odd behavior with jobs showing up in the history server in 1.0.0. Failed jobs do show up but it seems they can show up minutes or hours later. I see in the history server logs messages about bad task ids. But then eventually the jobs show up. This might be your situation. Anecdotall

hdfs short circuit

2014-07-03 Thread Jahagirdar, Madhu
can i enable spark to use "dfs.client.read.shortcircuit" property to improve performance and ready natively on local nodes instead of hdfs api ? The information contained in this message may be confidential and legally protected under applicable law. The message

Re: Case class in java

2014-07-03 Thread Akhil Das
Add all your jars like this and pass it to the SparkContext List *jars* = > Lists.newArrayList("/home/akhld/mobi/localcluster/x/spark-0.9.1-bin-hadoop2/assembly/target/scala-2.10/spark-assembly-0.9.1-hadoop2.2.0.jar", > > "/home/akhld/mobi/localcluster/codes/pig/build/ivy/lib/Pig/twitter4j-core-3.

Re: Case class in java

2014-07-03 Thread Kevin Jung
I found a web page for hint. http://ardoris.wordpress.com/2014/03/30/how-spark-does-class-loading/ I learned SparkIMain has internal httpserver to publish class object but can't figure out how I use it in java. Any ideas? Thanks, Kevin -- View this message in context: http://apache-spark-user-

Which version of Hive support Spark & Shark

2014-07-03 Thread Ravi Prasad
Hi , Can any one please help me to understand which version of Hive support Spark and Shark -- -- Regards, RAVI PRASAD. T

Re: java options for spark-1.0.0

2014-07-03 Thread Wanda Hawk
With spark-1.0.0 this is the "cmdline" from /proc/#pid: (with the export line "export _JAVA_OPTIONS="...") /usr/java/jdk1.8.0_05/bin/java-cp::/home/spark2013/spark-1.0.0/conf:/home/spark2013/spark-1.0.0/lib/spark-assembly-1.0.0-hadoop1.0.4.jar:/home/spark2013/spark-1.0.0/lib/datanucleus-core-3.2.

Re: write event logs with YARN

2014-07-03 Thread Christophe Préaud
Hi Andrew, This does not work (the application failed), I have the following error when I put 3 slashes in the hdfs scheme: (...) Caused by: java.lang.IllegalArgumentException: Pathname /dc1-ibd-corp-hadoop-01.corp.dc1.kelkoo.net:9000/user/kookel/spark-events/kelkoo.searchkeywordreport-140437468

Case class in java

2014-07-03 Thread Kevin Jung
Hi, I'm trying to convert scala spark job into java. In case of scala, I typically use 'case class' to apply schema to RDD. It can be converted into POJO class in java, but what I really want to do is dynamically creating POJO classes like scala REPL do. For this reason, I import javassist to creat

Re: Shark Vs Spark SQL

2014-07-03 Thread 田毅
add "MASTER=yarn-client" then the JDBC / Thrift server will run on yarn 2014-07-02 16:57 GMT-07:00 田毅 : > hi, Matei > > > Do you know how to run the JDBC / Thrift server on Yarn? > > > I did not find any suggestion in docs. > > > 2014-07-02 16:06 GMT-07:00 Matei Zaharia : > > Spark SQL in Spark

Re: Spark SQL - groupby

2014-07-03 Thread Takuya UESHIN
Hi, You need to import Sum and Count like: import org.apache.spark.sql.catalyst.expressions.{Sum,Count} // or with wildcard _ or if you use current master branch build, you can use sum('colB) instead of Sum('colB). Thanks. 2014-07-03 16:09 GMT+09:00 Subacini B : > Hi, > > Can someone provide

Re: RDD join: composite keys

2014-07-03 Thread Andrew Ash
Hi Sameer, If you set those two IDs to be a Tuple2 in the key of the RDD, then you can join on that tuple. Example: val rdd1: RDD[Tuple3[Int, Int, String]] = ... val rdd2: RDD[Tuple3[Int, Int, String]] = ... val resultRDD = rdd1.map(k => ((k._1, k._2), k._3)).join( rdd2.map(k =>

Re: installing spark 1 on hadoop 1

2014-07-03 Thread Akhil Das
If you have downloaded the pre-compiled binary, it will not have sbt directory inside it. Thanks Best Regards On Thu, Jul 3, 2014 at 12:35 PM, Akhil Das wrote: > Are you having sbt directory inside your spark directory? > > Thanks > Best Regards > > > On Wed, Jul 2, 2014 at 10:17 PM, Imran Akb

Re: Spark SQL - groupby

2014-07-03 Thread Subacini B
Hi, Can someone provide me pointers for this issue. Thanks Subacini On Wed, Jul 2, 2014 at 3:34 PM, Subacini B wrote: > Hi, > > Below code throws compilation error , "not found: *value Sum*" . Can > someone help me on this. Do i need to add any jars or imports ? even for > Count , same error

Re: installing spark 1 on hadoop 1

2014-07-03 Thread Akhil Das
Are you having sbt directory inside your spark directory? Thanks Best Regards On Wed, Jul 2, 2014 at 10:17 PM, Imran Akbar wrote: > Hi, >I'm trying to install spark 1 on my hadoop cluster running on EMR. I > didn't have any problem installing the previous versions, but on this > version I