We tried changing the compression codec from snappy to lz4. It did improve the
performance but we are still wondering why default options didn’t work as
claimed.
From: Raghavendra Pandey
raghavendra.pan...@gmail.commailto:raghavendra.pan...@gmail.com
Date: Friday, 6 February 2015 1:23 pm
To:
Hi
I would like to know how can I generate a random matrix where each element
come from a uniform distribution in -1, 1 .
In particular I would like the matrix be a distributed row matrix with
dimension n x p
Is this possible with mllib? Should I use another library?
--
View this message
Forgot to add the more recent training material:
https://databricks-training.s3.amazonaws.com/index.html
On Fri, Feb 6, 2015 at 12:12 PM, Burak Yavuz brk...@gmail.com wrote:
Hi Luca,
You can tackle this using RowMatrix (spark-shell example):
```
import
Hi Luca,
You can tackle this using RowMatrix (spark-shell example):
```
import org.apache.spark.mllib.linalg.distributed.RowMatrix
import org.apache.spark.mllib.random._
// sc is the spark context, numPartitions is the number of partitions you
want the RDD to be in
val data: RDD[Vector] =
Thank you Emre, This helps, i am able to get filename.
But i am not sure how to fit this into Dstream RDD.
val inputStream = ssc.textFileStream(/hdfs Path/)
inputStream is Dstreamrdd and in foreachrdd , am doing my processing
inputStream.foreachRDD(rdd = {
* //how to get filename here??*
OK I tried that, but how do I convert an RDD to a Set that I can then
broadcast and cache?
val badIPs = sc.textFile(hdfs:///user/jon/+ badfullIPs.csv)
val badIPsLines = badIPs.getLines
val badIpSet = badIPsLines.toSet
val badIPsBC = sc.broadcast(badIpSet)
produces the
Hi,
I have a dataset in csv format and I am trying to standardize the features
before using k-means clustering. The data does not have any labels but has
the following format:
s1, f12,f13,...
s2, f21,f22,...
where s is a string id, and f is a floating point feature value.
To perform feature
The yarn log aggregation is enabled and the logs which I get through yarn
logs -applicationId your_application_id
are no different than what I get through logs in Yarn Application tracking
URL. They still dont have the above logs.
On Fri, Feb 6, 2015 at 3:36 PM, Petar Zecevic
You can call collect() to pull in the contents of an RDD into the driver:
val badIPsLines = badIPs.collect()
On Fri, Feb 6, 2015 at 12:19 PM, Jon Gregg jonrgr...@gmail.com wrote:
OK I tried that, but how do I convert an RDD to a Set that I can then
broadcast and cache?
val badIPs =
To add to What Petar said, when YARN log aggregation is enabled, consider
specifying yarn.nodemanager.remote-app-log-dir which is where aggregated
logs are saved.
Cheers
On Fri, Feb 6, 2015 at 12:36 PM, Petar Zecevic petar.zece...@gmail.com
wrote:
You can enable YARN log aggregation
Thanks for the information, I have no any issue on connect my local Cassandra
server, However I still has issue on connect my company dev server. What’s
need to do to resolve this issue. Thanks so much.
-Vincent
From: Ankur Srivastava [mailto:ankur.srivast...@gmail.com]
Sent: Thursday,
Did the problem go away when you switched to lz4? There was a change from
the default compression codec fro 1.0 to 1.1, where we went from LZF to
Snappy. I don't think there was any such change from 1.1 to 1.2, though.
On Fri, Feb 6, 2015 at 12:17 AM, Praveen Garg praveen.g...@guavus.com
wrote:
I am trying to debug my mapPartitionsFunction. Here is the code. There are
two ways I am trying to log using log.info() or println(). I am running in
yarn-cluster mode. While I can see the logs from driver code, I am not able
to see logs from map, mapPartition functions in the Application Tracking
Hi,
You can use the Spark Kernel project (https://github.com/ibm-et/spark-kernel)
as a workaround of sorts. The Spark Kernel provides a generic solution to
dynamically interact with an Apache Spark cluster (think of a remote Spark
Shell). It serves as the driver application with which you can
You can enable YARN log aggregation (yarn.log-aggregation-enable to
true) and execute command
yarn logs -applicationId your_application_id
after your application finishes.
Or you can look at them directly in HDFS in
/tmp/logs/user/logs/applicationid/hostname
On 6.2.2015. 19:50, nitinkak001
You don't need HDFS or virtual machines to run Spark. You can just download it,
unzip it and run it on your laptop. See
http://spark.apache.org/docs/latest/index.html
http://spark.apache.org/docs/latest/index.html.
Matei
On Feb 6, 2015, at 2:58 PM, David Fallside falls...@us.ibm.com wrote:
Hi,
i am trying to issue a sql query against a parquet file and am getting
errors and would like some help to figure out what is going on.
The sql :
select timestamp, count(rid), qi.clientname from records where timestamp
0 group by qi.clientname
I am getting the following error:
You can't use columns (timestamp) that aren't in the GROUP BY clause.
Spark 1.2+ give you a better error message for this case.
On Fri, Feb 6, 2015 at 3:12 PM, Mohnish Kodnani mohnish.kodn...@gmail.com
wrote:
Hi,
i am trying to issue a sql query against a parquet file and am getting
errors
Hi,
I am trying to strict my serialized classes, as I am having weird issues with
regards to serialization.
However, my efforts hit a brick wall when I got the exception:
Caused by: java.lang.IllegalArgumentException: Class is not registered:
scala.reflect.ClassTag$$anon$1
Note: To register
yarn.nodemanager.remote-app-log-dir is set to /tmp/logs
On Fri, Feb 6, 2015 at 4:14 PM, Ted Yu yuzhih...@gmail.com wrote:
To add to What Petar said, when YARN log aggregation is enabled, consider
specifying yarn.nodemanager.remote-app-log-dir which is where aggregated
logs are saved.
King, consider trying the Spark Kernel (
https://github.com/ibm-et/spark-kernel) which will install Spark etc and
provide you with a Spark/Scala Notebook in which you can develop your
algorithm. The Vagrant installation described in
Doh :) Thanks.. seems like brain freeze.
On Fri, Feb 6, 2015 at 3:22 PM, Michael Armbrust mich...@databricks.com
wrote:
You can't use columns (timestamp) that aren't in the GROUP BY clause.
Spark 1.2+ give you a better error message for this case.
On Fri, Feb 6, 2015 at 3:12 PM, Mohnish
Spark 1.2
Data stored in parquet table (large number of rows)
Test 1
select a, sum(b), sum(c) from table
Test
sqlContext.cacheTable()
select a, sum(b), sum(c) from table - seed cache First time slow since
loading cache ?
select a, sum(b), sum(c) from table - Second time it should be faster
Can you try creating just a single spark context and then try your code.
If you want to use it for streaming pass the same sparkcontext object
instead of conf.
Note: Instead of just replying to me , try to use reply to all so that the
post is visible for the community . That way you can expect
Hi folks,
I am new to spark. I just get spark 1.2 to run on emr ami 3.3.1 (hadoop 2.4).
I ssh to emr master node and submit the job or start the shell. Everything runs
well except the webUI.
In order to see the UI, I used ssh tunnel which forward my dev machine port to
emr master node webUI
Hi folks,
I am new to spark. I just get spark 1.2 to run on emr ami 3.3.1 (hadoop
2.4).
I ssh to emr master node and submit the job or start the shell. Everything
runs well except the webUI.
In order to see the UI, I used ssh tunnel which forward my dev machine port
to emr master node webUI
Yes. It improved the performance but not only with spark 1.2 but spark 1.1
also. Precisely, job took more time to run in spark 1.2 with default options
but got completed in almost equal time when ran with “lz4” as of spark 1.1 with
“lz4”.
From: Aaron Davidson
Hi,
I've got the following code http://pastebin.com/3kexKwg6 that's almost
complete, but I have 2 questions:
1) Once I've computed the TF-IDF vector, how do I compute the vector for
each string to feed into the LabeledPoint?
2) Does MLLib provide any methods to evaluate the model's precision,
Check the storage tab. Does the table actually fit in memory? Otherwise
you are rebuilding column buffers in addition to reading the data off of
the disk.
On Fri, Feb 6, 2015 at 4:39 PM, Manoj Samel manojsamelt...@gmail.com
wrote:
Spark 1.2
Data stored in parquet table (large number of rows)
Hi,
You can do the following:
```
import org.apache.spark.mllib.linalg.distributed.RowMatrix
import org.apache.spark.mllib.random._
// sc is the spark context, numPartitions is the number of partitions you
want the RDD to be in
val dist: RDD[Vector] = RandomRDDs.normalVectorRDD(sc, n, k,
Hi Judy,
For driver, it is /metrics/json, there's no metricsServlet for executor.
Thanks
Jerry
From: Judy Nash [mailto:judyn...@exchange.microsoft.com]
Sent: Friday, February 6, 2015 3:47 PM
To: user@spark.apache.org
Subject: Spark Metrics Servlet for driver and executor
Hi all,
Looking at
Hi,
When we submit a PR in Github, there are various tests that are performed
like RAT test, Scala Style Test, and beyond this many other tests which run
for more time.
Could anyone please direct me to the details of the tests that are
performed there?
Thank You
Hi all,
I’m experiencing a strange behaviour of spark 1.2.
I’ve a 3 node cluster + the master.
each node has:
1 HDD 7200 rpm 1 TB
16 GB RAM
8 core
I configured executors with 6 cores and 10 GB each (
spark.storage.memoryFraction = 0.6 )
My job is pretty simple:
val file1 =
Hi,
is there a way to use custom python module that is available to all executors
under PYTHONPATH (without a need to upload it using sc.addPyFile()) - bit weird
that this module is on all nodes yet the spark tasks can't use it (references
to its objects are serialized and sent to all executors
That's definitely surprising to me that you would be hitting a lot of GC
for this scenario. Are you setting --executor-cores and
--executor-memory? What are you setting them to?
-Sandy
On Thu, Feb 5, 2015 at 10:17 AM, Guillermo Ortiz konstt2...@gmail.com
wrote:
Any idea why if I use more
Hi,
I have Hive in development, I want to use it in Spark. Spark-SQL document
says the following
/
Users who do not have an existing Hive deployment can still create a
HiveContext. When not configured by the hive-site.xml, the context
automatically creates metastore_db and warehouse in the
Have a look at the dev/run-tests script.
On Fri, Feb 6, 2015 at 2:58 AM, Deep Pradhan pradhandeep1...@gmail.com wrote:
Hi,
When we submit a PR in Github, there are various tests that are performed
like RAT test, Scala Style Test, and beyond this many other tests which run
for more time.
This is an execution with 80 executors
MetricMin25th percentileMedian75th percentileMax
Duration 31s 44s 50s 1.1min 2.6 min
GC Time 70ms 0.1s 0.3s 4s 53 s
Input 128.0MB 128.0MB 128.0MB 128.0MB 128.0MB
I executed as well with 40 executors
MetricMin25th percentileMedian75th percentileMax
Duration
Hi All,
sc.textFile will not work because the file is not distributed to other workers,
So I try to read the file first using FileUtils.readLines and then use
sc.parallelize, but the readLines failed because OOM (file is large).
Is there a way to split local files and upload those partition to
Hi All,
I already find a solution to solve this problem. Please ignore my question...
Thanx
Best regards,
Henry
From: MA33 YTHung1
Sent: Friday, February 6, 2015 4:34 PM
To: user@spark.apache.org
Subject: how to process a file in spark standalone cluster without distributed
storage (i.e.
I want to write a spark streaming consumer for kafka in java. I want to
process the data in real-time as well as store the data in hdfs in
year/month/day/hour/ format. I am not sure how to achieve this. Should I
write separate kafka consumers, one for writing data to HDFS and one for
spark
Yes, It's surpressing to me as well
I tried to execute it with different configurations,
sudo -u hdfs spark-submit --master yarn-client --class
com.mycompany.app.App --num-executors 40 --executor-memory 4g
Example-1.0-SNAPSHOT.jar hdfs://ip:8020/tmp/sparkTest/ file22.bin
parameters
This is
Hi!
I'm new to Spark. I have a case study that where the data is store in CSV
files. These files have headers with morte than 1000 columns. I would like
to know what are the best practice to parsing them and in special the
following points:
1. Getting and parsing all the files from a folder
2.
Yes, having many more cores than disks and all writing at the same time can
definitely cause performance issues. Though that wouldn't explain the high
GC. What percent of task time does the web UI report that tasks are
spending in GC?
On Fri, Feb 6, 2015 at 12:56 AM, Guillermo Ortiz
Thanks for the reply. Seems it is all set to zero in the latest code - I
was checking 1.2 last night.
On Fri Feb 06 2015 at 07:21:35 Sean Owen so...@cloudera.com wrote:
It looks like the initial intercept term is 1 only in the addIntercept
numOfLinearPredictor == 1 case. It does seem
Yes they are.
On Fri, Feb 6, 2015 at 5:06 PM, Mohit Durgapal durgapalmo...@gmail.com
wrote:
Just wanted to know If my emails are reaching the user list.
Regards
Mohit
--
[image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com
*Arush Kharbanda* || Technical Teamlead
Hi,
Is the implementation of All Pairs Shortest Path on GraphX for directed
graphs or undirected graph? When I use the algorithm with dataset, it
assumes that the graph is undirected.
Has anyone come across that earlier?
Thank you
Mohit,
I want to process the data in real-time as well as store the data in hdfs
in year/month/day/hour/ format.
Are you wanting to process it and then put it into HDFS or just put the raw
data into HDFS? If the later then why not just use Camus (
https://github.com/linkedin/camus), it will
Hi,
I have this doubt: Assume that an rdd is stored across multiple nodes and
one of the nodes fails. So, a partition is lost. Now, I know that when this
node is back, it uses the lineage from its neighbours and recomputes that
partition alone.
1) How does it get the source data (original data
I'm running on EC2 and I want to set the directory to use on the slaves
(mounted EBS volumes).
I have set:
spark.local.dir /vol3/my-spark-dir
in
/root/spark/conf/spark-defaults.conf
and replicated to all nodes. I have verified that in the console the value
in the config corresponds. I
I've been doing a bunch of work with CSVs in Spark, mostly saving them as a
merged CSV (instead of the various part-n files). You might find the
following links useful:
- This article is about combining the part files and outputting a header as
the first line in the merged results:
Hi Todd,
Thanks for the input.
I use IntelliJ as IDE and I create a SBT project. And in build.sbt I write
all the dependencies in build.sbt. For example hive,spark-sql etc. These
dependencies stays in local ivy2 repository after getting downloaded from
maven central. Should I go in ivy2 and
Thanks. The data is there, I have checked the row count and dump to file.
-Vincent
From: Ted Yu [mailto:yuzhih...@gmail.com]
Sent: Thursday, February 05, 2015 2:28 PM
To: Sun, Vincent Y
Cc: user
Subject: Re: get null potiner exception newAPIHadoopRDD.map()
Is it possible that
You can do this manually without much trouble: get your files on a
distributed store like HDFS, read them with textFile, filter out
headers, parse with a CSV library like Commons CSV, select columns,
format and store the result. That's tens of lines of code.
However you probably want to start by
Thanks. I think about it, yes, the DAG engine should not have issue to build
the right graph in different threads (at least in theory, it is not an issue).
So now I have another question: if I have a context initiated, but there is no
operation on it for very long time, will there a timeout
Can you try setting SPARK_LOCAL_DIRS in spark-env.sh ?
Cheers
On Fri, Feb 6, 2015 at 7:30 AM, Joe Wass jw...@crossref.org wrote:
I'm running on EC2 and I want to set the directory to use on the slaves
(mounted EBS volumes).
I have set:
spark.local.dir /vol3/my-spark-dir
in
Good questions, some of which I'd like to know the answer to.
Is it okay to update a NoSQL DB with aggregated counts per batch
interval or is it generally stored in hdfs?
This depends on how you are going to use the aggregate data.
1. Is there a lot of data? If so, and you are going to use
Hi Ashu,
Per the documents:
Configuration of Hive is done by placing your hive-site.xml file in conf/.
For example, you can place a something like this in your
$SPARK_HOME/conf/hive-site.xml file:
configuration
property
namehive.metastore.uris/name
*!-- Ensure that the following statement
Did you restart the slaves so they would read the settings? You don't need
to start/stop the EC2 cluster, just the slaves. From the master node:
$SPARK_HOME/sbin/stop-slaves.sh
$SPARK_HOME/sbin/start-slaves.sh
($SPARK_HOME is probably /root/spark)
On Fri Feb 06 2015 at 10:31:18 AM Joe Wass
As Sean said, this is just a few lines of code. You can see an example here:
https://github.com/AyasdiOpenSource/bigdf/blob/master/src/main/scala/com/ayasdi/bigdf/DF.scala#L660
https://github.com/AyasdiOpenSource/bigdf/blob/master/src/main/scala/com/ayasdi/bigdf/DF.scala#L660
On Feb 6, 2015,
I'm running Spark 1.2 with Yarn. My logs show that my executors are failing
to connect to my driver. This is because they are using the wrong hostname.
Since I'm running with Yarn, I can't set spark.driver.host as explained in
SPARK-4253. So it should come from my HDFS configuration. Do you
Hi all,
this is my first email with this mailing list and I hope that I am not
doing anything wrong.
I am currently trying to define a distributed matrix with n rows and k
columns where each element is randomly sampled by a uniform distribution.
How can I do that?
It would be also nice if you
Hi,
I'm new in Spark, I'd like to install Spark with Scala. The aim is to build
a data processing system foor door events.
the first step is install spark, scala, hdfs and other required tools.
the second is build the algorithm programm in Scala which can treat a file
of my data logs (events).
I think there are a number of misconceptions here. It is not necessary
that the original node come back in order to recreate the lost
partition. The lineage is not retrieved from neighboring nodes. The
source data is retrieved in the same way that it was the first time
that the partition was
ok.Is there no way to specify it in code, when I create SparkConf ?
From: Todd Nist tsind...@gmail.com
Sent: Friday, February 6, 2015 10:08 PM
To: Ashutosh Trivedi (MT2013030)
Cc: user@spark.apache.org
Subject: Re: Link existing Hive to Spark
You can always just
65 matches
Mail list logo