Re: Pig on Spark

2014-04-25 Thread suman bharadwaj
Hey Mayur, We use HiveColumnarLoader and XMLLoader. Are these working as well ? Will try few things regarding porting Java MR. Regards, Suman Bharadwaj S On Thu, Apr 24, 2014 at 3:09 AM, Mayur Rustagi mayur.rust...@gmail.comwrote: Right now UDF is not working. Its in the top list though.

MultipleOutputs IdentityReducer

2014-04-25 Thread Andre Kuhnen

MultipleOutputs IdentityReducer

2014-04-25 Thread Andre Kuhnen
Hello, I am trying to write multiple files with Spark, but I can not find a way to do it. Here is the idea. val rddKeyValue : Rdd[(String, String)] = rddlines.map( line = createKeyValue(line)) now I would like to save this as keyname.txt and all the values inside the file I tried to use this

RE: JMX with Spark

2014-04-25 Thread Ravi Hemnani
Can you share your working metrics.properties.? I want remote jmx to be enabled so i need to use the JMXSink and monitor my spark master and workers. But what are the parameters that are to be defined like host and port ? So your config can help. -- View this message in context:

read file from hdfs

2014-04-25 Thread Joe L
I have just 2 two questions? sc.textFile(hdfs://host:port/user/matei/whatever.txt) Is host master node? What port we should use? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/read-file-from-hdfs-tp4824.html Sent from the Apache Spark User List mailing

Re: Pig on Spark

2014-04-25 Thread Mark Baker
I've only had a quick look at Pig, but it seems that a declarative layer on top of Spark couldn't be anything other than a big win, as it allows developers to declare *what* they want, permitting the compiler to determine how best poke at the RDD API to implement it. In my brief time with Spark,

FW: reduceByKeyAndWindow - spark internals

2014-04-25 Thread Adrian Mocanu
Any suggestions where I can find this in the documentation or elsewhere? Thanks From: Adrian Mocanu [mailto:amoc...@verticalscope.com] Sent: April-24-14 11:26 AM To: u...@spark.incubator.apache.org Subject: reduceByKeyAndWindow - spark internals If I have this code: val stream1=

Re: Deploying a python code on a spark EC2 cluster

2014-04-25 Thread Shubhabrata
This is the error from stderr: Spark Executor Command: java -cp :/root/ephemeral-hdfs/conf:/root/ephemeral-hdfs/conf:/root/ephemeral-hdfs/conf:/root/spark/conf:/root/spark/assembly/target/scala-2.10/spark-assembly_2.10-0.9.1-hadoop1.0.4.jar -Djava.library.path=/root/ephemeral-hdfs/lib/native/

Re: Pig on Spark

2014-04-25 Thread Eugen Cepoi
It depends, personally I have the opposite opinion. IMO expressing pipelines in a functional language feels natural, you just have to get used with the language (scala). Testing spark jobs is easy where testing a Pig script is much harder and not natural. If you want a more high level language

Re: Deploying a python code on a spark EC2 cluster

2014-04-25 Thread Shubhabrata
In order to check if there is any issue with python API I ran a scala application provided in the examples. Still the same error ./bin/run-example org.apache.spark.examples.SparkPi spark://[Master-URL]:7077 SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in

Re: what is the best way to do cartesian

2014-04-25 Thread Alex Boisvert
You might want to try the built-in RDD.cartesian() method. On Thu, Apr 24, 2014 at 9:05 PM, Qin Wei wei@dewmobile.net wrote: Hi All, I have a problem with the Item-Based Collaborative Filtering Recommendation Algorithms in spark. The basic flow is as below:

Re: what is the best way to do cartesian

2014-04-25 Thread Eugen Cepoi
Depending on the size of the rdd you could also do a collect broadcast and then compute the product in a map function over the other rdd. If this is the same rdd you might also want to cache it. This pattern worked quite good for me Le 25 avr. 2014 18:33, Alex Boisvert alex.boisv...@gmail.com a

Spark Shark 0.9.1 on ec2 with Hadoop 2 error

2014-04-25 Thread jesseerdmann
I've run into a problem trying to launch a cluster using the provided ec2 python script with --hadoop-major-version 2. The launch completes correctly with the exception of an Exception getting thrown for Tachyon 7 (I've included it at the end of the message, but that is not the focus and seems

Re: Securing Spark's Network

2014-04-25 Thread Akhil Das
Hi Jacob, This post might give you a brief idea about the ports being used https://groups.google.com/forum/#!topic/spark-users/PN0WoJiB0TA On Fri, Apr 25, 2014 at 8:53 PM, Jacob Eisinger jeis...@us.ibm.com wrote: Howdy, We tried running Spark 0.9.1 stand-alone inside docker containers

Strange lookup behavior. Possible bug?

2014-04-25 Thread Yadid Ayzenberg
Hi All, Im running a lookup on a JavaPairRDDString, Tuple2. When running on local machine - the lookup is successfull. However, when running a standalone cluster with the exact same dataset - one of the tasks never ends (constantly in RUNNING status). When viewing the worker log, it seems that

help

2014-04-25 Thread Joe L
I need someone's help please I am getting the following error. [error] 14/04/26 03:09:47 INFO cluster.SparkDeploySchedulerBackend: Executor app-20140426030946-0004/8 removed: class java.io.IOException: Cannot run program /home/exobrain/install/spark-0.9.1/bin/compute-classpath.sh (in directory

Re: Pig on Spark

2014-04-25 Thread Bharath Mundlapudi
I've only had a quick look at Pig, but it seems that a declarative layer on top of Spark couldn't be anything other than a big win, as it allows developers to declare *what* they want, permitting the compiler to determine how best poke at the RDD API to implement it. The devil is in the

Re: Spark and HBase

2014-04-25 Thread Josh Mahonin
Phoenix generally presents itself as an endpoint using JDBC, which in my testing seems to play nicely using JdbcRDD. However, a few days ago a patch was made against Phoenix to implement support via PIG using a custom Hadoop InputFormat, which means now it has Spark support too. Here's a code

Re: help

2014-04-25 Thread Joe L
hi thank you for your reply but I could not find it. it says that no such file or directory http://apache-spark-user-list.1001560.n3.nabble.com/file/n4848/Capture.png -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/help-tp4841p4848.html Sent from the

Build times for Spark

2014-04-25 Thread Williams, Ken
I've cloned the github repo and I'm building Spark on a pretty beefy machine (24 CPUs, 78GB of RAM) and it takes a pretty long time. For instance, today I did a 'git pull' for the first time in a week or two, and then doing 'sbt/sbt assembly' took 43 minutes of wallclock time (88 minutes of

Scala Spark / Shark: How to access existing Hive tables in Hortonworks?

2014-04-25 Thread Darq Moth
I am trying to find some docs / description of the approach on the subject, please help. I have Hadoop 2.2.0 from Hortonworks installed with some existing Hive tables I need to query. Hive SQL works extremly and unreasonably slow on single node and cluster as well. I hope Shark will work faster.

Re: Scala Spark / Shark: How to access existing Hive tables in Hortonworks?

2014-04-25 Thread Mayur Rustagi
You have to configure shark to access the Hortonworks hive metastore (hcatalog?) you will start seeing the tables in shark shell can run queries like normal shark will leverage spark for processing your queries. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi

Re: Build times for Spark

2014-04-25 Thread Akhil Das
You can always increase the sbt memory by setting export JAVA_OPTS=-Xmx10g Thanks Best Regards On Sat, Apr 26, 2014 at 2:17 AM, Williams, Ken ken.willi...@windlogics.comwrote: No, I haven't done any config for SBT. Is there somewhere you might be able to point me toward for how to do

Re: Securing Spark's Network

2014-04-25 Thread Jacob Eisinger
Howdy Akhil, Thanks - that did help! And, it made me think about how the EC2 scripts work [1] to set up security. From my understanding of EC2 security groups [2], this just sets up external access, right? (This has no effect on internal communication between the instances, right?) I am

Re: Build times for Spark

2014-04-25 Thread Shivaram Venkataraman
Are you by any chance building this on NFS ? As far as I know the build is severely bottlenecked by filesystem calls during assembly (each class file in each dependency gets a fstat call or something like that). That is partly why building from say a local ext4 filesystem or a SSD is much faster

Re: Build times for Spark

2014-04-25 Thread Shivaram Venkataraman
AFAIK the resolver does pick up things form your local ~/.m2 -- Note that as ~/.m2 is on NFS that adds to the amount of filesystem traffic. Shivaram On Fri, Apr 25, 2014 at 2:57 PM, Williams, Ken ken.willi...@windlogics.comwrote: I am indeed, but it's a pretty fast NFS. I don't have any SSD

Re: Spark and HBase

2014-04-25 Thread Nicholas Chammas
Josh, is there a specific use pattern you think is served well by Phoenix + Spark? Just curious. On Fri, Apr 25, 2014 at 3:17 PM, Josh Mahonin jmaho...@filetrek.com wrote: Phoenix generally presents itself as an endpoint using JDBC, which in my testing seems to play nicely using JdbcRDD.

Re: Strange lookup behavior. Possible bug?

2014-04-25 Thread Yadid Ayzenberg
Some additional information - maybe this rings a bell with someone: I suspect this happens when the lookup returns more than one value. For 0 and 1 values, the function behaves as you would expect. Anyone ? On 4/25/14, 1:55 PM, Yadid Ayzenberg wrote: Hi All, Im running a lookup on a

Re: help

2014-04-25 Thread Jey Kottalam
Sorry, but I don't know where Cloudera puts the executor log files. Maybe their docs give the correct path? On Fri, Apr 25, 2014 at 12:32 PM, Joe L selme...@yahoo.com wrote: hi thank you for your reply but I could not find it. it says that no such file or directory

Running out of memory Naive Bayes

2014-04-25 Thread John King
I've been trying to use the Naive Bayes classifier. Each example in the dataset is about 2 million features, only about 20-50 of which are non-zero, so the vectors are very sparse. I keep running out of memory though, even for about 1000 examples on 30gb RAM while the entire dataset is 4 million

Re: parallelize for a large Seq is extreamly slow.

2014-04-25 Thread Earthson
I've tried to set larger buffer, but reduceByKey seems to be failed. need help:) 14/04/26 12:31:12 INFO cluster.CoarseGrainedSchedulerBackend: Shutting down all executors 14/04/26 12:31:12 INFO cluster.CoarseGrainedSchedulerBackend: Asking each executor to shut down 14/04/26 12:31:12 INFO