Re: Spark partial data in memory/and partial in disk

2015-02-27 Thread Akhil Das
You can use persist(StorageLevel.MEMORY_AND_DISK) if you are not having sufficient memory to cache everything. Thanks Best Regards On Fri, Feb 27, 2015 at 7:20 PM, Siddharth Ubale siddharth.ub...@syncoms.com wrote: Hi, How do we manage putting partial data in to memory and partial into

Some questions after playing a little with the new ml.Pipeline.

2015-02-27 Thread Jaonary Rabarisoa
Dear all, We mainly do large scale computer vision task (image classification, retrieval, ...). The pipeline is really great stuff for that. We're trying to reproduce the tutorial given on that topic during the latest spark summit (

JLine hangs under Windows8

2015-02-27 Thread Cheng, Hao
Hi, All I was trying to run spark sql cli on windows 8 for debugging purpose, however, seems the JLine hangs in waiting input after ENTER key, I didn't see that under Linux, is there anybody meet the same issue? The call stack as below: main prio=6 tid=0x02548800 nid=0x17cc runnable

Re: Kafka DStream Parallelism

2015-02-27 Thread Sean Owen
The coarsest level at which you can parallelize is topic. Topics are all but unrelated to each other so can be consumed independently. But you can parallelize within the context of a topic too. A Kafka group ID defines a consumer group. One consumer in a group receive each message to the topic

Re: Kafka DStream Parallelism

2015-02-27 Thread Corey Nolet
This was what I was thinking but wanted to verify. Thanks Sean! On Fri, Feb 27, 2015 at 9:56 PM, Sean Owen so...@cloudera.com wrote: The coarsest level at which you can parallelize is topic. Topics are all but unrelated to each other so can be consumed independently. But you can parallelize

RE: JLine hangs under Windows8

2015-02-27 Thread Cheng, Hao
It works after adding the -Djline.terminal=jline.UnsupportedTerminal -Original Message- From: Cheng, Hao [mailto:hao.ch...@intel.com] Sent: Saturday, February 28, 2015 10:24 AM To: user@spark.apache.org Subject: JLine hangs under Windows8 Hi, All I was trying to run spark sql cli

how to improve performance of spark job with large input to executor?

2015-02-27 Thread ey-chih chow
Hi, I ran a spark job. Each executor is allocated a chuck of input data. For the executor with a small chunk of input data, the performance is reasonable good. But for the executor with a large chunk of input data, the performance is not good. How can I tune Spark configuration parameters to

Re: Upgrade to Spark 1.2.1 using Guava

2015-02-27 Thread Sean Owen
This seems like a job for userClassPathFirst. Or could be. It's definitely an issue of visibility between where the serializer is and where the user class is. At the top you said Pat that you didn't try this, but why not? On Fri, Feb 27, 2015 at 10:11 PM, Pat Ferrel p...@occamsmachete.com wrote:

Re: java.util.NoSuchElementException: key not found:

2015-02-27 Thread Shixiong Zhu
RDD is not thread-safe. You should not use it in multiple threads. Best Regards, Shixiong Zhu 2015-02-27 23:14 GMT+08:00 rok rokros...@gmail.com: I'm seeing this java.util.NoSuchElementException: key not found: exception pop up sometimes when I run operations on an RDD from multiple threads

Re: Errors in spark

2015-02-27 Thread Yana Kadiyska
I was actually just able to reproduce the issue. I do wonder if this is a bug -- the docs say When not configured by the hive-site.xml, the context automatically creates metastore_db and warehouse in the current directory. But as you can see in from the message warehouse is not in the current

Number of cores per executor on Spark Standalone

2015-02-27 Thread bit1...@163.com
Hi , I know that spark on yarn has a configuration parameter(executor-cores NUM) to specify the number of cores per executor. How about spark standalone? I can specify the total cores, but how could I know how many cores each executor will take(presume one node one executor)?

Re: Speed Benchmark

2015-02-27 Thread Davies Liu
No. It should not be that slow. In my Mac, it took 1.4 minutes to do `rdd.count()` on 4.3G text file ( 25M / s / CPU). Could you turn on profile in pyspark to see what happened in Python process? spark.python.profile = true On Fri, Feb 27, 2015 at 4:14 PM, Guillaume Guy

Re: Error when running spark-shell

2015-02-27 Thread amoners
Please put your logs, you can get logs follow below: # An error report file with more information is saved as: # /Users/anupamajoshi/spark-1.2.0-bin-hadoop2.4/bin/hs_err_pid4709.log -- View this message in context:

Re: Error when running spark-shell

2015-02-27 Thread Sean Owen
Well, that would just show the JVM bug. This isnt a Spark issue. The JVM crashes and not because of some native code used by Spark. On Feb 28, 2015 2:04 AM, amoners amon...@lwjendure.com wrote: Please put your logs, you can get logs follow below: # An error report file with more information is

Re: SparkStreaming failing with exception Could not compute split, block input

2015-02-27 Thread Mukesh Jha
I'm streamin data from kafka topic using kafkautils doing some computation and writing records to hbase. Storage level is memory-and-disk-ser On 27 Feb 2015 16:20, Akhil Das ak...@sigmoidanalytics.com wrote: You could be hitting this issue https://issues.apache.org/jira/browse/SPARK-4516

Re: Speed Benchmark

2015-02-27 Thread Jason Bell
How many machines are on the cluster? And what is the configuration of those machines (Cores/RAM)? Small cluster is very subjective statement. Guillaume Guy wrote: Dear Spark users: I want to see if anyone has an idea of the performance for a small cluster.

Re: SparkStreaming failing with exception Could not compute split, block input

2015-02-27 Thread Mukesh Jha
Also my job is map only so there is no shuffle/reduce phase. On Fri, Feb 27, 2015 at 7:10 PM, Mukesh Jha me.mukesh@gmail.com wrote: I'm streamin data from kafka topic using kafkautils doing some computation and writing records to hbase. Storage level is memory-and-disk-ser On 27 Feb

Re: Question about Spark best practice when counting records.

2015-02-27 Thread Paweł Szulc
Currently if you use accumulators inside actions (like foreach) you have guarantee that, even if partition will be recalculated, the values will be correct. Same thing does NOT apply to transformations and you can not relay 100% on the values. Pawel Szulc pt., 27 lut 2015, 4:54 PM Darin McBeath

Re: Question about Spark best practice when counting records.

2015-02-27 Thread Kostas Sakellis
Hey Darin, Record count metrics are coming in Spark 1.3. Can you wait until it is released? Or do you need a solution in older versions of spark. Kostas On Friday, February 27, 2015, Darin McBeath ddmcbe...@yahoo.com.invalid wrote: I have a fairly large Spark job where I'm essentially

Re: Question about Spark best practice when counting records.

2015-02-27 Thread Darin McBeath
Thanks for you quick reply. Yes, that would be fine. I would rather wait/use the optimal approach as opposed to hacking some one-off solution. Darin. From: Kostas Sakellis kos...@cloudera.com To: Darin McBeath ddmcbe...@yahoo.com Cc: User

Re: High CPU usage in Driver

2015-02-27 Thread Paweł Szulc
Thanks for coming back to the list with response! pt., 27 lut 2015, 3:16 PM Himanish Kushary użytkownik himan...@gmail.com napisał: Hi, I was able to solve the issue. Putting down the settings that worked for me. 1) It was happening due to the large number of partitions.I *coalesce*'d

Re: Spark excludes fastutil dependencies we need

2015-02-27 Thread Jim Kleckner
Yes, I used both. The discussion on this seems to be at github now: https://github.com/apache/spark/pull/4780 I am using more classes from a package from which spark uses HyperLogLog as well. So we are both including the jar file but Spark is excluding the dependent package that is required.

Re: Is there any Sparse Matrix implementation in Spark/MLib?

2015-02-27 Thread Vijay Saraswat
Available in GML -- http://x10-lang.org/x10-community/applications/global-matrix-library.html We are exploring how to make it available within Spark. Any ideas would be much appreciated. On 2/27/15 7:01 AM, shahab wrote: Hi, I just wonder if there is any Sparse Matrix implementation

Re: High CPU usage in Driver

2015-02-27 Thread Himanish Kushary
Hi, I was able to solve the issue. Putting down the settings that worked for me. 1) It was happening due to the large number of partitions.I *coalesce*'d the RDD as early as possible in my code into lot less partitions ( used . coalesce(1) to bring down from 500K to 10k) 2) Increased the

Re: Is there any Sparse Matrix implementation in Spark/MLib?

2015-02-27 Thread shahab
Thanks a lot Vijay, let me see how it performs. Best Shahab On Friday, February 27, 2015, Vijay Saraswat vi...@saraswat.org wrote: Available in GML -- http://x10-lang.org/x10-community/applications/global-matrix-library.html We are exploring how to make it available within Spark. Any ideas

Re: Speed Benchmark

2015-02-27 Thread Sean Owen
That's very slow, and there are a lot of possible explanations. The first one that comes to mind is: I assume your YARN and HDFS are on the same machines, but are you running executors on all HDFS nodes when you run this? if not, a lot of these reads could be remote. You have 6 executor slots,

Re: Is there any Sparse Matrix implementation in Spark/MLib?

2015-02-27 Thread shahab
Thanks, But do you know if access to Coordinated matrix elements is almost as fast as a normal matrix or it has access time similar to RDD ( relatively slow)? I am looking for some fast access sparse matrix data structure. On Friday, February 27, 2015, Peter Rudenko petro.rude...@gmail.com

Re: Is there any Sparse Matrix implementation in Spark/MLib?

2015-02-27 Thread Ritesh Kumar Singh
try using breeze (scala linear algebra library) On Fri, Feb 27, 2015 at 5:56 PM, shahab shahab.mok...@gmail.com wrote: Thanks a lot Vijay, let me see how it performs. Best Shahab On Friday, February 27, 2015, Vijay Saraswat vi...@saraswat.org wrote: Available in GML --

Re: Unable to run hive queries inside spark

2015-02-27 Thread sandeep vura
Hi Kundan, Sorry even i am also facing the similar issue today.How did you resolve this issue? Regards, Sandeep.v On Thu, Feb 26, 2015 at 2:25 AM, Michael Armbrust mich...@databricks.com wrote: It looks like that is getting interpreted as a local path. Are you missing a core-site.xml file

Speed Benchmark

2015-02-27 Thread Guillaume Guy
Dear Spark users: I want to see if anyone has an idea of the performance for a small cluster. Reading from HDFS, what should be the performance of a count() operation on an 10GB RDD with 100M rows using pyspark. I looked into the CPU usage, all 6 are at 100%. Details: - master yarn-client

Spark partial data in memory/and partial in disk

2015-02-27 Thread Siddharth Ubale
Hi, How do we manage putting partial data in to memory and partial into disk where data resides in hive table ? We have tried using the available documentation but unable to go ahead with above approach , we are only able to cache the entire table or uncache it. Thanks, Siddharth Ubale,

Re: Running spark function on parquet without sql

2015-02-27 Thread tridib
Somehow my posts are not getting excepted, and replies are not visible here. But I got following reply from Zhan. From Zhan Zhang's reply, yes I still get the parquet's advantage. My next question is, if I operate on SchemaRdd will I get the advantage of Spark SQL's in memory columnar store

Re: Global sequential access of elements in RDD

2015-02-27 Thread Imran Rashid
Why would you want to use spark to sequentially process your entire data set? The entire purpose is to let you do distributed processing -- which means letting partitions get processed simultaneously by different cores / nodes. that being said, occasionally in a bigger pipeline with a lot of

RE: group by order by fails

2015-02-27 Thread Tridib Samanta
Thanks Michael! It worked. Some how my mails are not getting accepted by spark user mailing list. :( From: mich...@databricks.com Date: Thu, 26 Feb 2015 17:49:43 -0800 Subject: Re: group by order by fails To: tridib.sama...@live.com CC: ak...@sigmoidanalytics.com; user@spark.apache.org Assign

Re: Iterating on RDDs

2015-02-27 Thread Vijayasarathy Kannan
As you suggested, I tried to save the grouped RDD and persisted it in memory before the iterations begin. The performance seems to be much better now. My previous comment that the run times doubled was from a wrong observation. Thanks. On Fri, Feb 27, 2015 at 10:27 AM, Vijayasarathy Kannan

Re: Iterating on RDDs

2015-02-27 Thread Vijayasarathy Kannan
Thanks. I tried persist() on the RDD. The runtimes appear to have doubled now (without persist() it was ~7s per iteration and now its ~15s). I am running standalone Spark on a 8-core machine. Any thoughts on why the increase in runtime? On Thu, Feb 26, 2015 at 4:27 PM, Imran Rashid

Re: Errors in spark

2015-02-27 Thread sandeep vura
Hi yana, I have removed hive-site.xml from spark/conf directory but still getting the same errors. Anyother way to work around. Regards, Sandeep On Fri, Feb 27, 2015 at 9:38 PM, Yana Kadiyska yana.kadiy...@gmail.com wrote: I think you're mixing two things: the docs say When* not *configured

Errors in spark

2015-02-27 Thread sandeep vura
Hi Sparkers, I am using hive version - hive 0.13 and copied hive-site.xml in spark/conf and using default derby local metastore . While creating a table in spark shell getting the following error ..Can any one please look and give solution asap.. sqlContext.sql(CREATE TABLE IF NOT EXISTS

java.util.NoSuchElementException: key not found:

2015-02-27 Thread rok
I'm seeing this java.util.NoSuchElementException: key not found: exception pop up sometimes when I run operations on an RDD from multiple threads in a python application. It ends up shutting down the SparkContext so I'm assuming this is a bug -- from what I understand, I should be able to run

Re: Errors in spark

2015-02-27 Thread Yana Kadiyska
I think you're mixing two things: the docs say When* not *configured by the hive-site.xml, the context automatically creates metastore_db and warehouse in the current directory.. AFAIK if you want a local metastore, you don't put hive-site.xml anywhere. You only need the file if you're going to

Failed to parse Hive query

2015-02-27 Thread Anusha Shamanur
Hi, I am trying to do this in spark-shell: val hiveCtx = neworg.apache.spark.sql.hive.HiveContext(sc) val listTables =hiveCtx.hql(show tables) The second line fails to execute with this message: warning: there were 1 deprecation warning(s); re-run with -deprecation for details

Question about Spark best practice when counting records.

2015-02-27 Thread Darin McBeath
I have a fairly large Spark job where I'm essentially creating quite a few RDDs, do several types of joins using these RDDS resulting in a final RDD which I write back to S3. Along the way, I would like to capture record counts for some of these RDDs. My initial approach was to use the count

Re: Running out of space (when there's no shortage)

2015-02-27 Thread Kelvin Chu
Hi Joe, you might increase spark.yarn.executor.memoryOverhead to see if it fixes the problem. Please take a look of this report: https://issues.apache.org/jira/browse/SPARK-4996 Hope this helps. On Tue, Feb 24, 2015 at 2:05 PM, Yiannis Gkoufas johngou...@gmail.com wrote: No problem, Joe. There

Re: job keeps failing with org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 1

2015-02-27 Thread Kelvin Chu
Hi Darin, you might increase spark.yarn.executor.memoryOverhead to see if it fixes the problem. Please take a look of this report: https://issues.apache.org/jira/browse/SPARK-4996 On Fri, Feb 27, 2015 at 12:38 AM, Arush Kharbanda ar...@sigmoidanalytics.com wrote: Can you share what error you

Problem getting program to run on 15TB input

2015-02-27 Thread Arun Luthra
My program in pseudocode looks like this: val conf = new SparkConf().setAppName(Test) .set(spark.storage.memoryFraction,0.2) // default 0.6 .set(spark.shuffle.memoryFraction,0.12) // default 0.2 .set(spark.shuffle.manager,SORT) // preferred setting for optimized joins

Re: Race Condition in Streaming Thread

2015-02-27 Thread Tathagata Das
Are you sure the multiple invocations are not from previous runs of the program? TD On Fri, Feb 27, 2015 at 12:16 PM, Nastooh Avessta (navesta) nave...@cisco.com wrote: Hi Under Spark 1.0.0, standalone, client mode am trying to invoke a 3rd party udp traffic generator, from the streaming

Re: Race Condition in Streaming Thread

2015-02-27 Thread Tathagata Das
Its wasn't clear from the snippet whats going on. Can your provide the whole Receiver code? TD On Fri, Feb 27, 2015 at 12:37 PM, Nastooh Avessta (navesta) nave...@cisco.com wrote: I am, as I issue killall -9 Prog, prior to testing. Cheers, [image:

What joda-time dependency does spark submit use/need?

2015-02-27 Thread Su She
Hello Everyone, I'm having some issues launching (non-spark) applications via the spark-submit commands. The common error I am getting is c/p below. I am able to submit a spark streaming/kafka spark application, but can't start a dynamoDB java app. The common error is related to joda-time. 1) I

Re: Upgrade to Spark 1.2.1 using Guava

2015-02-27 Thread Pat Ferrel
I understand that I need to supply Guava to Spark. The HashBiMap is created in the client and broadcast to the workers. So it is needed in both. To achieve this there is a deps.jar with Guava (and Scopt but that is only for the client). Scopt is found so I know the jar is fine for the client.

Race Condition in Streaming Thread

2015-02-27 Thread Nastooh Avessta (navesta)
Hi Under Spark 1.0.0, standalone, client mode am trying to invoke a 3rd party udp traffic generator, from the streaming thread. The excerpt is as follows: ... do{ try { p = Runtime.getRuntime().exec(Prog ); socket.receive(packet);

RE: Race Condition in Streaming Thread

2015-02-27 Thread Nastooh Avessta (navesta)
Thank you for your time and effort. Here is the code: --- public final class Multinode extends ReceiverOutput { String host = null; int portRx = -1; int portTx = -1; private final Semaphore sem = new

How to debug a Hung task

2015-02-27 Thread Manas Kar
Hi, I have a spark application that hangs on doing just one task (Rest 200-300 task gets completed in reasonable time) I can see in the Thread dump which function gets stuck how ever I don't have a clue as to what value is causing that behaviour. Also, logging the inputs before the function is

Re: Upgrade to Spark 1.2.1 using Guava

2015-02-27 Thread Marcelo Vanzin
Ah, I see. That makes a lot of sense now. You might be running into some weird class loader visibility issue. I've seen some bugs in jira about this in the past, maybe you're hitting one of them. Until I have some time to investigate (of if you're curious feel free to scavenge jira), a

RE: Race Condition in Streaming Thread

2015-02-27 Thread Nastooh Avessta (navesta)
I am, as I issue killall -9 Prog, prior to testing. Cheers, [http://www.cisco.com/web/europe/images/email/signature/logo05.jpg] Nastooh Avessta ENGINEER.SOFTWARE ENGINEERING nave...@cisco.com Phone: +1 604 647 1527 Cisco Systems Limited 595 Burrard Street, Suite 2123 Three Bentall Centre, PO

Re: Problem getting program to run on 15TB input

2015-02-27 Thread Burak Yavuz
Hi, Not sure if it can help, but `StorageLevel.MEMORY_AND_DISK_SER` generates many small objects that lead to very long GC time, causing the executor losts, heartbeat not received, and GC overhead limit exceeded messages. Could you try using `StorageLevel.MEMORY_AND_DISK` instead? You can also

Re: Upgrade to Spark 1.2.1 using Guava

2015-02-27 Thread Erlend Hamnaberg
Hi. I have had a simliar issue. I had to pull the JavaSerializer source into my own project, just so I got the classloading of this class under control. This must be a class loader issue with spark. -E On Fri, Feb 27, 2015 at 8:52 PM, Pat Ferrel p...@occamsmachete.com wrote: I understand

Re: Upgrade to Spark 1.2.1 using Guava

2015-02-27 Thread Marcelo Vanzin
On Fri, Feb 27, 2015 at 1:30 PM, Pat Ferrel p...@occamsmachete.com wrote: @Marcelo do you mean by modifying spark.executor.extraClassPath on all workers, that didn’t seem to work? That's an app configuration, not a worker configuration, so if you're trying to set it on the worker configuration

Spark SQL Converting RDD to SchemaRDD without hardcoding a case class in scala

2015-02-27 Thread kpeng1
Hi All, I am currently trying to build out a spark job that would basically convert a csv file into parquet. From what I have seen it looks like spark sql is the way to go and how I would go about this would be to load in the csv file into an RDD and convert it into a schemaRDD by injecting in

Re: Speed Benchmark

2015-02-27 Thread Guillaume Guy
Hi Jason: Thanks for your feedback. Beside the information above I mentioned, there are 3 machines in the cluster. *1st one*: Driver + has a bunch of Hadoop services. 32GB of RAM, 8 cores (2 used) *2nd + 3rd: *16B of RAM, 4 cores (2 used each) I hope this helps clarify. Thx. GG Best,

Re: Upgrade to Spark 1.2.1 using Guava

2015-02-27 Thread Pat Ferrel
I don’t use spark-submit I have a standalone app. So I guess you want me to add that key/value to the conf in my code and make sure it exists on workers. On Feb 27, 2015, at 1:47 PM, Marcelo Vanzin van...@cloudera.com wrote: On Fri, Feb 27, 2015 at 1:42 PM, Pat Ferrel p...@occamsmachete.com

Re: Upgrade to Spark 1.2.1 using Guava

2015-02-27 Thread Pat Ferrel
@Erlend hah, we were trying to merge your PR and ran into this—small world. You actually compile the JavaSerializer source in your project? @Marcelo do you mean by modifying spark.executor.extraClassPath on all workers, that didn’t seem to work? On Feb 27, 2015, at 1:23 PM, Erlend Hamnaberg

Re: Upgrade to Spark 1.2.1 using Guava

2015-02-27 Thread Marcelo Vanzin
On Fri, Feb 27, 2015 at 1:42 PM, Pat Ferrel p...@occamsmachete.com wrote: I changed in the spark master conf, which is also the only worker. I added a path to the jar that has guava in it. Still can’t find the class. Sorry, I'm still confused about what config you're changing. I'm suggesting

Running hive query from spark

2015-02-27 Thread Anusha Shamanur
Hi, I am trying to do this in spark-shell: val hiveCtx = neworg.apache.spark.sql.hive.HiveContext(sc) val listTables =hiveCtx.hql(show tables) The second line fails to execute with this message: warning: there were 1 deprecation warning(s); re-run with -deprecation for details

Re: Speed Benchmark

2015-02-27 Thread Guillaume Guy
Hi Sean: Thanks for your feedback. Scala is much faster. The count is performed in ~1 minutes (vs 17min). I would expect scala to be 2-5X faster but this gap seems to be more than that. Is that also your conclusion? Thanks. Best, Guillaume Guy * +1 919 - 972 - 8750* On Fri, Feb 27, 2015 at

Re: Upgrade to Spark 1.2.1 using Guava

2015-02-27 Thread Pat Ferrel
Thanks! that worked. On Feb 27, 2015, at 1:50 PM, Pat Ferrel p...@occamsmachete.com wrote: I don’t use spark-submit I have a standalone app. So I guess you want me to add that key/value to the conf in my code and make sure it exists on workers. On Feb 27, 2015, at 1:47 PM, Marcelo Vanzin

Re: Upgrade to Spark 1.2.1 using Guava

2015-02-27 Thread Pat Ferrel
I’ll try to find a Jira for it. I hope a fix is in 1.3 On Feb 27, 2015, at 1:59 PM, Pat Ferrel p...@occamsmachete.com wrote: Thanks! that worked. On Feb 27, 2015, at 1:50 PM, Pat Ferrel p...@occamsmachete.com wrote: I don’t use spark-submit I have a standalone app. So I guess you want me to

Re: Upgrade to Spark 1.2.1 using Guava

2015-02-27 Thread Pat Ferrel
I changed in the spark master conf, which is also the only worker. I added a path to the jar that has guava in it. Still can’t find the class. Trying Erland’s idea next. On Feb 27, 2015, at 1:35 PM, Marcelo Vanzin van...@cloudera.com wrote: On Fri, Feb 27, 2015 at 1:30 PM, Pat Ferrel

Re: Spark SQL Converting RDD to SchemaRDD without hardcoding a case class in scala

2015-02-27 Thread Michael Armbrust
http://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema On Fri, Feb 27, 2015 at 1:39 PM, kpeng1 kpe...@gmail.com wrote: Hi All, I am currently trying to build out a spark job that would basically convert a csv file into parquet. From what I have

Re: Is SPARK_CLASSPATH really deprecated?

2015-02-27 Thread Patrick Wendell
I think we need to just update the docs, it is a bit unclear right now. At the time, we made it worded fairly sternly because we really wanted people to use --jars when we deprecated SPARK_CLASSPATH. But there are other types of deployments where there is a legitimate need to augment the classpath

Re: job keeps failing with org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 1

2015-02-27 Thread Arush Kharbanda
Can you share what error you are getting when the job fails. On Thu, Feb 26, 2015 at 4:32 AM, Darin McBeath ddmcbe...@yahoo.com.invalid wrote: I'm using Spark 1.2, stand-alone cluster on ec2 I have a cluster of 8 r3.8xlarge machines but limit the job to only 128 cores. I have also tried

spark.default.parallelism

2015-02-27 Thread Deep Pradhan
Hi, I have a four single core machines as slaves in my cluster. I set the spark.default.parallelism to 4 and ran SparkTC given in examples. It took around 26 sec. Now, I increased the spark.default.parallelism to 8, but my performance deteriorates. The same application takes 32 sec now. I have

Re: How to pass a org.apache.spark.rdd.RDD in a recursive function

2015-02-27 Thread Arush Kharbanda
Passing RDD's around is not a good idea. RDD's are immutable and cant be changed inside functions. Have you considered taking a different approach? On Thu, Feb 26, 2015 at 3:42 AM, dritanbleco dritan.bl...@gmail.com wrote: Hello i am trying to pass as a parameter a org.apache.spark.rdd.RDD

Re: How to get yarn logs to display in the spark or yarn history-server?

2015-02-27 Thread Christophe Préaud
Yes, spark.yarn.historyServer.address is used to access the spark history server from yarn, it is not needed if you use only the yarn history server. It may be possible to have both history servers running, but I have not tried that yet. Besides, as far as I have understood, yarn and spark

Re: SparkStreaming failing with exception Could not compute split, block input

2015-02-27 Thread Akhil Das
You could be hitting this issue https://issues.apache.org/jira/browse/SPARK-4516 Apart from that little more information about your job would be helpful. Thanks Best Regards On Wed, Feb 25, 2015 at 11:34 AM, Mukesh Jha me.mukesh@gmail.com wrote: Hi Experts, My Spark Job is failing with

Facing error: java.lang.ArrayIndexOutOfBoundsException while executing SparkSQL join query

2015-02-27 Thread anamika gupta
I have three tables with the following schema: case class* date_d*(WID: Int, CALENDAR_DATE: java.sql.Timestamp, DATE_STRING: String, DAY_OF_WEEK: String, DAY_OF_MONTH: Int, DAY_OF_YEAR: Int, END_OF_MONTH_FLAG: String, YEARWEEK: Int, CALENDAR_MONTH: String, MONTH_NUM: Int, YEARMONTH: Int, QUARTER:

Re: Augment more data to existing MatrixFactorization Model?

2015-02-27 Thread Jeffrey Jedele
Hey Anish, machine learning models that are updated with incoming data are commonly known as online learning systems. Matrix factorization is one way to implement recommender systems, but not the only one. There are papers about how to do online matrix factorization, but you will likely have to

Facing error: java.lang.ArrayIndexOutOfBoundsException while executing SparkSQL join query

2015-02-27 Thread anu
I have three tables with the following schema: case class *date_d*(WID: Int, CALENDAR_DATE: java.sql.Timestamp, DATE_STRING: String, DAY_OF_WEEK: String, DAY_OF_MONTH: Int, DAY_OF_YEAR: Int, END_OF_MONTH_FLAG: String, YEARWEEK: Int, CALENDAR_MONTH: String, MONTH_NUM: Int, YEARMONTH: Int, QUARTER:

Re: how to map and filter in one step?

2015-02-27 Thread Jeffrey Jedele
Hi, we are using RDD#mapPartitions() to achieve the same. Are there advantages/disadvantages of using one method over the other? Regards, Jeff 2015-02-26 20:02 GMT+01:00 Mark Hamstra m...@clearstorydata.com: rdd.map(foo).filter(bar) and rdd.filter(bar).map(foo) will each already be pipelined

Re: Speed Benchmark

2015-02-27 Thread Sean Owen
Is machine 1 the only one running an HDFS data node? You describe it as one running Hadoop services. On Feb 27, 2015 9:44 PM, Guillaume Guy guillaume.c@gmail.com wrote: Hi Jason: Thanks for your feedback. Beside the information above I mentioned, there are 3 machines in the cluster.

Re: Running spark function on parquet without sql

2015-02-27 Thread Deborah Siegel
Hi Michael, Would you help me understand the apparent difference here.. The Spark 1.2.1 programming guide indicates: Note that if you call schemaRDD.cache() rather than sqlContext.cacheTable(...), tables will *not* be cached using the in-memory columnar format, and therefore

Re: Speed Benchmark

2015-02-27 Thread Guillaume Guy
It is a simple text file. I'm not using SQL. just doing a rdd.count() on it. Does the bug affect it? On Friday, February 27, 2015, Davies Liu dav...@databricks.com wrote: What is this dataset? text file or parquet file? There is an issue with serialization in Spark SQL, which will make it

Re: What joda-time dependency does spark submit use/need?

2015-02-27 Thread Todd Nist
You can specify these jars (joda-time-2.7.jar, joda-convert-1.7.jar) either as part of your build and assembly or via the --jars option to spark-submit. HTH. On Fri, Feb 27, 2015 at 2:48 PM, Su She suhsheka...@gmail.com wrote: Hello Everyone, I'm having some issues launching (non-spark)

Re: Failed to parse Hive query

2015-02-27 Thread Michael Armbrust
Do you have a hive-site.xml file or a core-site.xml file? Perhaps something is misconfigured there? On Fri, Feb 27, 2015 at 7:17 AM, Anusha Shamanur anushas...@gmail.com wrote: Hi, I am trying to do this in spark-shell: val hiveCtx = neworg.apache.spark.sql.hive.HiveContext(sc) val

Re: Speed Benchmark

2015-02-27 Thread Davies Liu
What is this dataset? text file or parquet file? There is an issue with serialization in Spark SQL, which will make it very slow, see https://issues.apache.org/jira/browse/SPARK-6055, will be fixed very soon. Davies On Fri, Feb 27, 2015 at 1:59 PM, Guillaume Guy guillaume.c@gmail.com wrote:

Re: group by order by fails

2015-02-27 Thread iceback
String query = select s.name, count(s.name) as tally from sample s group by s.name order by tally; -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/group-by-order-by-fails-tp21815p21854.html Sent from the Apache Spark User List mailing list archive at

Kafka DStream Parallelism

2015-02-27 Thread Corey Nolet
Looking @ [1], it seems to recommend pull from multiple Kafka topics in order to parallelize data received from Kafka over multiple nodes. I notice in [2], however, that one of the createConsumer() functions takes a groupId. So am I understanding correctly that creating multiple DStreams with the

Re: Global sequential access of elements in RDD

2015-02-27 Thread Wush Wu
Thanks for your reply. But your code snippet uses the `collect` which is not feasible for me. My algorithm involves a large amount of data and I do not want to transmit them. Wush 2015-02-27 16:27 GMT+08:00 Yanbo Liang yblia...@gmail.com: Actually, sortBy will return an ordered RDD. Your

Re: spark streaming, batchinterval,windowinterval and window sliding interval difference

2015-02-27 Thread Jeffrey Jedele
If you read the streaming programming guide, you'll notice that Spark does not do real streaming but emulates it with a so-called mini-batching approach. Let's say you want to work with a continuous stream of incoming events from a computing centre: Batch interval: That's the basic heartbeat of

NoSuchElementException: None.get

2015-02-27 Thread patcharee
Hi, I got NoSuchElementException when I tried to iterate through a Map which contains some elements (not null, not empty). When I debug my code (below). It seems the first part of the code which fills the Map is executed after the second part that iterates the Map. The 1st part and 2nd part

Re: One of the executor not getting StopExecutor message

2015-02-27 Thread Akhil Das
Mostly, that particular executor is stuck on GC Pause, what operation are you performing? You can try increasing the parallelism if you see only 1 executor is doing the task. Thanks Best Regards On Fri, Feb 27, 2015 at 11:39 AM, twinkle sachdeva twinkle.sachd...@gmail.com wrote: Hi, I am

Re: SparkStreaming failing with exception Could not compute split, block input

2015-02-27 Thread Jeffrey Jedele
I don't have an idea, but perhaps a little more context would be helpful. What is the source of your streaming data? What's the storage level you're using? What are you doing? Some kind of windows operations? Regards, Jeff 2015-02-26 18:59 GMT+01:00 Mukesh Jha me.mukesh@gmail.com: On

Re: Get importerror when i run pyspark with ipython=1

2015-02-27 Thread sourabhguha
http://apache-spark-user-list.1001560.n3.nabble.com/file/n21853/pythonpath.jpg Here's the PYTHONPATH. It points to the correct location. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Get-importerror-when-i-run-pyspark-with-ipython-1-tp21843p21853.html

Re: Running spark function on parquet without sql

2015-02-27 Thread Michael Armbrust
From Zhan Zhang's reply, yes I still get the parquet's advantage. You will need to at least use SQL or the DataFrame API (coming in Spark 1.3) to specify the columns that you want in order to get the parquet benefits. The rest of your operations can be standard Spark. My next question is,

Error when running spark-shell

2015-02-27 Thread AJ614
Has anyone seen this - # A fatal error has been detected by the Java Runtime Environment: # # SIGSEGV (0xb) at pc=0x000110fcd9cd, pid=4709, tid=11011 # # JRE version: Java(TM) SE Runtime Environment (8.0_25-b17) (build 1.8.0_25-b17) # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.25-b02

Re: Failed to parse Hive query

2015-02-27 Thread Anusha Shamanur
I do. What tags should I change in this? I changed the value of hive.exec.scratchdir to /tmp/hive. What else? On Fri, Feb 27, 2015 at 2:14 PM, Michael Armbrust mich...@databricks.com wrote: Do you have a hive-site.xml file or a core-site.xml file? Perhaps something is misconfigured there?

Perf impact of BlockManager byte[] copies

2015-02-27 Thread Paul Wais
Dear List, I'm investigating some problems related to native code integration with Spark, and while picking through BlockManager I noticed that data (de)serialization currently issues lots of array copies. Specifically: - Deserialization: BlockManager marshals all deserialized bytes through a

Is there any Sparse Matrix implementation in Spark/MLib?

2015-02-27 Thread shahab
Hi, I just wonder if there is any Sparse Matrix implementation available in Spark, so it can be used in spark application? best, /Shahab

Re: Is there any Sparse Matrix implementation in Spark/MLib?

2015-02-27 Thread Peter Rudenko
Yes, it's called Coordinated Matrix(http://spark.apache.org/docs/latest/mllib-data-types.html#coordinatematrix) you need to fill it with elemets of type MatrixEntry( (Long, Long, Double)) Thanks, Peter Rudenko On 2015-02-27 14:01, shahab wrote: Hi, I just wonder if there is any Sparse