Hey Mayur,
We use HiveColumnarLoader and XMLLoader. Are these working as well ?
Will try few things regarding porting Java MR.
Regards,
Suman Bharadwaj S
On Thu, Apr 24, 2014 at 3:09 AM, Mayur Rustagi mayur.rust...@gmail.comwrote:
Right now UDF is not working. Its in the top list though.
Hello,
I am trying to write multiple files with Spark, but I can not find a way to
do it.
Here is the idea.
val rddKeyValue : Rdd[(String, String)] = rddlines.map( line =
createKeyValue(line))
now I would like to save this as keyname.txt and all the values inside
the file
I tried to use this
Can you share your working metrics.properties.?
I want remote jmx to be enabled so i need to use the JMXSink and monitor my
spark master and workers.
But what are the parameters that are to be defined like host and port ?
So your config can help.
--
View this message in context:
I have just 2 two questions?
sc.textFile(hdfs://host:port/user/matei/whatever.txt)
Is host master node?
What port we should use?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/read-file-from-hdfs-tp4824.html
Sent from the Apache Spark User List mailing
I've only had a quick look at Pig, but it seems that a declarative
layer on top of Spark couldn't be anything other than a big win, as it
allows developers to declare *what* they want, permitting the compiler
to determine how best poke at the RDD API to implement it.
In my brief time with Spark,
Any suggestions where I can find this in the documentation or elsewhere?
Thanks
From: Adrian Mocanu [mailto:amoc...@verticalscope.com]
Sent: April-24-14 11:26 AM
To: u...@spark.incubator.apache.org
Subject: reduceByKeyAndWindow - spark internals
If I have this code:
val stream1=
This is the error from stderr:
Spark Executor Command: java -cp
:/root/ephemeral-hdfs/conf:/root/ephemeral-hdfs/conf:/root/ephemeral-hdfs/conf:/root/spark/conf:/root/spark/assembly/target/scala-2.10/spark-assembly_2.10-0.9.1-hadoop1.0.4.jar
-Djava.library.path=/root/ephemeral-hdfs/lib/native/
It depends, personally I have the opposite opinion.
IMO expressing pipelines in a functional language feels natural, you just
have to get used with the language (scala).
Testing spark jobs is easy where testing a Pig script is much harder and
not natural.
If you want a more high level language
In order to check if there is any issue with python API I ran a scala
application provided in the examples. Still the same error
./bin/run-example org.apache.spark.examples.SparkPi
spark://[Master-URL]:7077
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in
You might want to try the built-in RDD.cartesian() method.
On Thu, Apr 24, 2014 at 9:05 PM, Qin Wei wei@dewmobile.net wrote:
Hi All,
I have a problem with the Item-Based Collaborative Filtering Recommendation
Algorithms in spark.
The basic flow is as below:
Depending on the size of the rdd you could also do a collect broadcast and
then compute the product in a map function over the other rdd. If this is
the same rdd you might also want to cache it. This pattern worked quite
good for me
Le 25 avr. 2014 18:33, Alex Boisvert alex.boisv...@gmail.com a
I've run into a problem trying to launch a cluster using the provided ec2
python script with --hadoop-major-version 2. The launch completes correctly
with the exception of an Exception getting thrown for Tachyon 7 (I've
included it at the end of the message, but that is not the focus and seems
Hi Jacob,
This post might give you a brief idea about the ports being used
https://groups.google.com/forum/#!topic/spark-users/PN0WoJiB0TA
On Fri, Apr 25, 2014 at 8:53 PM, Jacob Eisinger jeis...@us.ibm.com wrote:
Howdy,
We tried running Spark 0.9.1 stand-alone inside docker containers
Hi All,
Im running a lookup on a JavaPairRDDString, Tuple2.
When running on local machine - the lookup is successfull. However, when
running a standalone cluster with the exact same dataset - one of the
tasks never ends (constantly in RUNNING status).
When viewing the worker log, it seems that
I need someone's help please I am getting the following error.
[error] 14/04/26 03:09:47 INFO cluster.SparkDeploySchedulerBackend: Executor
app-20140426030946-0004/8 removed: class java.io.IOException: Cannot run
program /home/exobrain/install/spark-0.9.1/bin/compute-classpath.sh (in
directory
I've only had a quick look at Pig, but it seems that a declarative
layer on top of Spark couldn't be anything other than a big win, as it
allows developers to declare *what* they want, permitting the compiler
to determine how best poke at the RDD API to implement it.
The devil is in the
Phoenix generally presents itself as an endpoint using JDBC, which in my
testing seems to play nicely using JdbcRDD.
However, a few days ago a patch was made against Phoenix to implement
support via PIG using a custom Hadoop InputFormat, which means now it has
Spark support too.
Here's a code
hi thank you for your reply but I could not find it. it says that no such
file or directory
http://apache-spark-user-list.1001560.n3.nabble.com/file/n4848/Capture.png
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/help-tp4841p4848.html
Sent from the
I've cloned the github repo and I'm building Spark on a pretty beefy machine
(24 CPUs, 78GB of RAM) and it takes a pretty long time.
For instance, today I did a 'git pull' for the first time in a week or two, and
then doing 'sbt/sbt assembly' took 43 minutes of wallclock time (88 minutes of
I am trying to find some docs / description of the approach on the subject,
please help. I have Hadoop 2.2.0 from Hortonworks installed with some
existing Hive tables I need to query. Hive SQL works extremly and
unreasonably slow on single node and cluster as well. I hope Shark will
work faster.
You have to configure shark to access the Hortonworks hive metastore
(hcatalog?) you will start seeing the tables in shark shell can run
queries like normal shark will leverage spark for processing your queries.
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi
You can always increase the sbt memory by setting
export JAVA_OPTS=-Xmx10g
Thanks
Best Regards
On Sat, Apr 26, 2014 at 2:17 AM, Williams, Ken
ken.willi...@windlogics.comwrote:
No, I haven't done any config for SBT. Is there somewhere you might be
able to point me toward for how to do
Howdy Akhil,
Thanks - that did help! And, it made me think about how the EC2 scripts
work [1] to set up security. From my understanding of EC2 security groups
[2], this just sets up external access, right? (This has no effect on
internal communication between the instances, right?)
I am
Are you by any chance building this on NFS ? As far as I know the build is
severely bottlenecked by filesystem calls during assembly (each class file
in each dependency gets a fstat call or something like that). That is
partly why building from say a local ext4 filesystem or a SSD is much
faster
AFAIK the resolver does pick up things form your local ~/.m2 -- Note that
as ~/.m2 is on NFS that adds to the amount of filesystem traffic.
Shivaram
On Fri, Apr 25, 2014 at 2:57 PM, Williams, Ken
ken.willi...@windlogics.comwrote:
I am indeed, but it's a pretty fast NFS. I don't have any SSD
Josh, is there a specific use pattern you think is served well by Phoenix +
Spark? Just curious.
On Fri, Apr 25, 2014 at 3:17 PM, Josh Mahonin jmaho...@filetrek.com wrote:
Phoenix generally presents itself as an endpoint using JDBC, which in my
testing seems to play nicely using JdbcRDD.
Some additional information - maybe this rings a bell with someone:
I suspect this happens when the lookup returns more than one value.
For 0 and 1 values, the function behaves as you would expect.
Anyone ?
On 4/25/14, 1:55 PM, Yadid Ayzenberg wrote:
Hi All,
Im running a lookup on a
Sorry, but I don't know where Cloudera puts the executor log files.
Maybe their docs give the correct path?
On Fri, Apr 25, 2014 at 12:32 PM, Joe L selme...@yahoo.com wrote:
hi thank you for your reply but I could not find it. it says that no such
file or directory
I've been trying to use the Naive Bayes classifier. Each example in the
dataset is about 2 million features, only about 20-50 of which are
non-zero, so the vectors are very sparse. I keep running out of memory
though, even for about 1000 examples on 30gb RAM while the entire dataset
is 4 million
I've tried to set larger buffer, but reduceByKey seems to be failed. need
help:)
14/04/26 12:31:12 INFO cluster.CoarseGrainedSchedulerBackend: Shutting down
all executors
14/04/26 12:31:12 INFO cluster.CoarseGrainedSchedulerBackend: Asking each
executor to shut down
14/04/26 12:31:12 INFO
31 matches
Mail list logo