Re: Shuffle read/write issue in spark 1.2

2015-02-06 Thread Praveen Garg
We tried changing the compression codec from snappy to lz4. It did improve the performance but we are still wondering why default options didn’t work as claimed. From: Raghavendra Pandey raghavendra.pan...@gmail.commailto:raghavendra.pan...@gmail.com Date: Friday, 6 February 2015 1:23 pm To:

generate a random matrix with uniform distribution

2015-02-06 Thread Donbeo
Hi I would like to know how can I generate a random matrix where each element come from a uniform distribution in -1, 1 . In particular I would like the matrix be a distributed row matrix with dimension n x p Is this possible with mllib? Should I use another library? -- View this message

Re: matrix of random variables with spark.

2015-02-06 Thread Burak Yavuz
Forgot to add the more recent training material: https://databricks-training.s3.amazonaws.com/index.html On Fri, Feb 6, 2015 at 12:12 PM, Burak Yavuz brk...@gmail.com wrote: Hi Luca, You can tackle this using RowMatrix (spark-shell example): ``` import

Re: matrix of random variables with spark.

2015-02-06 Thread Burak Yavuz
Hi Luca, You can tackle this using RowMatrix (spark-shell example): ``` import org.apache.spark.mllib.linalg.distributed.RowMatrix import org.apache.spark.mllib.random._ // sc is the spark context, numPartitions is the number of partitions you want the RDD to be in val data: RDD[Vector] =

Re: Get filename in Spark Streaming

2015-02-06 Thread Subacini B
Thank you Emre, This helps, i am able to get filename. But i am not sure how to fit this into Dstream RDD. val inputStream = ssc.textFileStream(/hdfs Path/) inputStream is Dstreamrdd and in foreachrdd , am doing my processing inputStream.foreachRDD(rdd = { * //how to get filename here??*

Re: How to broadcast a variable read from a file in yarn-cluster mode?

2015-02-06 Thread Jon Gregg
OK I tried that, but how do I convert an RDD to a Set that I can then broadcast and cache? val badIPs = sc.textFile(hdfs:///user/jon/+ badfullIPs.csv) val badIPsLines = badIPs.getLines val badIpSet = badIPsLines.toSet val badIPsBC = sc.broadcast(badIpSet) produces the

MLLib: feature standardization

2015-02-06 Thread SK
Hi, I have a dataset in csv format and I am trying to standardize the features before using k-means clustering. The data does not have any labels but has the following format: s1, f12,f13,... s2, f21,f22,... where s is a string id, and f is a floating point feature value. To perform feature

Re: Where can I find logs set inside RDD processing functions?

2015-02-06 Thread Nitin kak
The yarn log aggregation is enabled and the logs which I get through yarn logs -applicationId your_application_id are no different than what I get through logs in Yarn Application tracking URL. They still dont have the above logs. On Fri, Feb 6, 2015 at 3:36 PM, Petar Zecevic

Re: How to broadcast a variable read from a file in yarn-cluster mode?

2015-02-06 Thread Sandy Ryza
You can call collect() to pull in the contents of an RDD into the driver: val badIPsLines = badIPs.collect() On Fri, Feb 6, 2015 at 12:19 PM, Jon Gregg jonrgr...@gmail.com wrote: OK I tried that, but how do I convert an RDD to a Set that I can then broadcast and cache? val badIPs =

Re: Where can I find logs set inside RDD processing functions?

2015-02-06 Thread Ted Yu
To add to What Petar said, when YARN log aggregation is enabled, consider specifying yarn.nodemanager.remote-app-log-dir which is where aggregated logs are saved. Cheers On Fri, Feb 6, 2015 at 12:36 PM, Petar Zecevic petar.zece...@gmail.com wrote: You can enable YARN log aggregation

RE: Connecting Cassandra by unknow host

2015-02-06 Thread Sun, Vincent Y
Thanks for the information, I have no any issue on connect my local Cassandra server, However I still has issue on connect my company dev server. What’s need to do to resolve this issue. Thanks so much. -Vincent From: Ankur Srivastava [mailto:ankur.srivast...@gmail.com] Sent: Thursday,

Re: Shuffle read/write issue in spark 1.2

2015-02-06 Thread Aaron Davidson
Did the problem go away when you switched to lz4? There was a change from the default compression codec fro 1.0 to 1.1, where we went from LZF to Snappy. I don't think there was any such change from 1.1 to 1.2, though. On Fri, Feb 6, 2015 at 12:17 AM, Praveen Garg praveen.g...@guavus.com wrote:

Where can I find logs set inside RDD processing functions?

2015-02-06 Thread nitinkak001
I am trying to debug my mapPartitionsFunction. Here is the code. There are two ways I am trying to log using log.info() or println(). I am running in yarn-cluster mode. While I can see the logs from driver code, I am not able to see logs from map, mapPartition functions in the Application Tracking

Re: spark driver behind firewall

2015-02-06 Thread Chip Senkbeil
Hi, You can use the Spark Kernel project (https://github.com/ibm-et/spark-kernel) as a workaround of sorts. The Spark Kernel provides a generic solution to dynamically interact with an Apache Spark cluster (think of a remote Spark Shell). It serves as the driver application with which you can

Re: Where can I find logs set inside RDD processing functions?

2015-02-06 Thread Petar Zecevic
You can enable YARN log aggregation (yarn.log-aggregation-enable to true) and execute command yarn logs -applicationId your_application_id after your application finishes. Or you can look at them directly in HDFS in /tmp/logs/user/logs/applicationid/hostname On 6.2.2015. 19:50, nitinkak001

Re: Beginner in Spark

2015-02-06 Thread Matei Zaharia
You don't need HDFS or virtual machines to run Spark. You can just download it, unzip it and run it on your laptop. See http://spark.apache.org/docs/latest/index.html http://spark.apache.org/docs/latest/index.html. Matei On Feb 6, 2015, at 2:58 PM, David Fallside falls...@us.ibm.com wrote:

Spark SQL group by

2015-02-06 Thread Mohnish Kodnani
Hi, i am trying to issue a sql query against a parquet file and am getting errors and would like some help to figure out what is going on. The sql : select timestamp, count(rid), qi.clientname from records where timestamp 0 group by qi.clientname I am getting the following error:

Re: Spark SQL group by

2015-02-06 Thread Michael Armbrust
You can't use columns (timestamp) that aren't in the GROUP BY clause. Spark 1.2+ give you a better error message for this case. On Fri, Feb 6, 2015 at 3:12 PM, Mohnish Kodnani mohnish.kodn...@gmail.com wrote: Hi, i am trying to issue a sql query against a parquet file and am getting errors

Problem when using spark.kryo.registrationRequired=true

2015-02-06 Thread Zalzberg, Idan (Agoda)
Hi, I am trying to strict my serialized classes, as I am having weird issues with regards to serialization. However, my efforts hit a brick wall when I got the exception: Caused by: java.lang.IllegalArgumentException: Class is not registered: scala.reflect.ClassTag$$anon$1 Note: To register

Re: Where can I find logs set inside RDD processing functions?

2015-02-06 Thread Nitin kak
yarn.nodemanager.remote-app-log-dir is set to /tmp/logs On Fri, Feb 6, 2015 at 4:14 PM, Ted Yu yuzhih...@gmail.com wrote: To add to What Petar said, when YARN log aggregation is enabled, consider specifying yarn.nodemanager.remote-app-log-dir which is where aggregated logs are saved.

Re: Beginner in Spark

2015-02-06 Thread David Fallside
King, consider trying the Spark Kernel ( https://github.com/ibm-et/spark-kernel) which will install Spark etc and provide you with a Spark/Scala Notebook in which you can develop your algorithm. The Vagrant installation described in

Re: Spark SQL group by

2015-02-06 Thread Mohnish Kodnani
Doh :) Thanks.. seems like brain freeze. On Fri, Feb 6, 2015 at 3:22 PM, Michael Armbrust mich...@databricks.com wrote: You can't use columns (timestamp) that aren't in the GROUP BY clause. Spark 1.2+ give you a better error message for this case. On Fri, Feb 6, 2015 at 3:12 PM, Mohnish

SQL group by on Parquet table slower when table cached

2015-02-06 Thread Manoj Samel
Spark 1.2 Data stored in parquet table (large number of rows) Test 1 select a, sum(b), sum(c) from table Test sqlContext.cacheTable() select a, sum(b), sum(c) from table - seed cache First time slow since loading cache ? select a, sum(b), sum(c) from table - Second time it should be faster

Re: NaiveBayes classifier causes ShuffleDependency class cast exception

2015-02-06 Thread VISHNU SUBRAMANIAN
Can you try creating just a single spark context and then try your code. If you want to use it for streaming pass the same sparkcontext object instead of conf. Note: Instead of just replying to me , try to use reply to all so that the post is visible for the community . That way you can expect

WebUI on yarn through ssh tunnel affected by ami filtered

2015-02-06 Thread Qichi Yang
Hi folks, I am new to spark. I just get spark 1.2 to run on emr ami 3.3.1 (hadoop 2.4). I ssh to emr master node and submit the job or start the shell. Everything runs well except the webUI. In order to see the UI, I used ssh tunnel which forward my dev machine port to emr master node webUI

WebUI on yarn through ssh tunnel affected by AmIpfilter

2015-02-06 Thread yangqch
Hi folks, I am new to spark. I just get spark 1.2 to run on emr ami 3.3.1 (hadoop 2.4). I ssh to emr master node and submit the job or start the shell. Everything runs well except the webUI. In order to see the UI, I used ssh tunnel which forward my dev machine port to emr master node webUI

Re: Shuffle read/write issue in spark 1.2

2015-02-06 Thread Praveen Garg
Yes. It improved the performance but not only with spark 1.2 but spark 1.1 also. Precisely, job took more time to run in spark 1.2 with default options but got completed in almost equal time when ran with “lz4” as of spark 1.1 with “lz4”. From: Aaron Davidson

naive bayes text classifier with tf-idf in pyspark

2015-02-06 Thread Imran Akbar
Hi, I've got the following code http://pastebin.com/3kexKwg6 that's almost complete, but I have 2 questions: 1) Once I've computed the TF-IDF vector, how do I compute the vector for each string to feed into the LabeledPoint? 2) Does MLLib provide any methods to evaluate the model's precision,

Re: SQL group by on Parquet table slower when table cached

2015-02-06 Thread Michael Armbrust
Check the storage tab. Does the table actually fit in memory? Otherwise you are rebuilding column buffers in addition to reading the data off of the disk. On Fri, Feb 6, 2015 at 4:39 PM, Manoj Samel manojsamelt...@gmail.com wrote: Spark 1.2 Data stored in parquet table (large number of rows)

Re: generate a random matrix with uniform distribution

2015-02-06 Thread Burak Yavuz
Hi, You can do the following: ``` import org.apache.spark.mllib.linalg.distributed.RowMatrix import org.apache.spark.mllib.random._ // sc is the spark context, numPartitions is the number of partitions you want the RDD to be in val dist: RDD[Vector] = RandomRDDs.normalVectorRDD(sc, n, k,

RE: Spark Metrics Servlet for driver and executor

2015-02-06 Thread Shao, Saisai
Hi Judy, For driver, it is /metrics/json, there's no metricsServlet for executor. Thanks Jerry From: Judy Nash [mailto:judyn...@exchange.microsoft.com] Sent: Friday, February 6, 2015 3:47 PM To: user@spark.apache.org Subject: Spark Metrics Servlet for driver and executor Hi all, Looking at

PR Request

2015-02-06 Thread Deep Pradhan
Hi, When we submit a PR in Github, there are various tests that are performed like RAT test, Scala Style Test, and beyond this many other tests which run for more time. Could anyone please direct me to the details of the tests that are performed there? Thank You

spark 1.2 writing on parquet after a join never ends - GC problems

2015-02-06 Thread Paolo Platter
Hi all, I’m experiencing a strange behaviour of spark 1.2. I’ve a 3 node cluster + the master. each node has: 1 HDD 7200 rpm 1 TB 16 GB RAM 8 core I configured executors with 6 cores and 10 GB each ( spark.storage.memoryFraction = 0.6 ) My job is pretty simple: val file1 =

pyspark importing custom module

2015-02-06 Thread Antony Mayi
Hi, is there a way to use custom python module that is available to all executors under PYTHONPATH (without a need to upload it using sc.addPyFile()) - bit weird that this module is on all nodes yet the spark tasks can't use it (references to its objects are serialized and sent to all executors

Re: Problems with GC and time to execute with different number of executors.

2015-02-06 Thread Sandy Ryza
That's definitely surprising to me that you would be hitting a lot of GC for this scenario. Are you setting --executor-cores and --executor-memory? What are you setting them to? -Sandy On Thu, Feb 5, 2015 at 10:17 AM, Guillermo Ortiz konstt2...@gmail.com wrote: Any idea why if I use more

Link existing Hive to Spark

2015-02-06 Thread ashu
Hi, I have Hive in development, I want to use it in Spark. Spark-SQL document says the following / Users who do not have an existing Hive deployment can still create a HiveContext. When not configured by the hive-site.xml, the context automatically creates metastore_db and warehouse in the

Re: PR Request

2015-02-06 Thread Sean Owen
Have a look at the dev/run-tests script. On Fri, Feb 6, 2015 at 2:58 AM, Deep Pradhan pradhandeep1...@gmail.com wrote: Hi, When we submit a PR in Github, there are various tests that are performed like RAT test, Scala Style Test, and beyond this many other tests which run for more time.

Re: Problems with GC and time to execute with different number of executors.

2015-02-06 Thread Guillermo Ortiz
This is an execution with 80 executors MetricMin25th percentileMedian75th percentileMax Duration 31s 44s 50s 1.1min 2.6 min GC Time 70ms 0.1s 0.3s 4s 53 s Input 128.0MB 128.0MB 128.0MB 128.0MB 128.0MB I executed as well with 40 executors MetricMin25th percentileMedian75th percentileMax Duration

how to process a file in spark standalone cluster without distributed storage (i.e. HDFS/EC2)?

2015-02-06 Thread Henry Hung
Hi All, sc.textFile will not work because the file is not distributed to other workers, So I try to read the file first using FileUtils.readLines and then use sc.parallelize, but the readLines failed because OOM (file is large). Is there a way to split local files and upload those partition to

RE: how to process a file in spark standalone cluster without distributed storage (i.e. HDFS/EC2)?

2015-02-06 Thread Henry Hung
Hi All, I already find a solution to solve this problem. Please ignore my question... Thanx Best regards, Henry From: MA33 YTHung1 Sent: Friday, February 6, 2015 4:34 PM To: user@spark.apache.org Subject: how to process a file in spark standalone cluster without distributed storage (i.e.

spark streaming from kafka real time + batch processing in java

2015-02-06 Thread Mohit Durgapal
I want to write a spark streaming consumer for kafka in java. I want to process the data in real-time as well as store the data in hdfs in year/month/day/hour/ format. I am not sure how to achieve this. Should I write separate kafka consumers, one for writing data to HDFS and one for spark

Re: Problems with GC and time to execute with different number of executors.

2015-02-06 Thread Guillermo Ortiz
Yes, It's surpressing to me as well I tried to execute it with different configurations, sudo -u hdfs spark-submit --master yarn-client --class com.mycompany.app.App --num-executors 40 --executor-memory 4g Example-1.0-SNAPSHOT.jar hdfs://ip:8020/tmp/sparkTest/ file22.bin parameters This is

Parsing CSV files in Spark

2015-02-06 Thread Spico Florin
Hi! I'm new to Spark. I have a case study that where the data is store in CSV files. These files have headers with morte than 1000 columns. I would like to know what are the best practice to parsing them and in special the following points: 1. Getting and parsing all the files from a folder 2.

Re: Problems with GC and time to execute with different number of executors.

2015-02-06 Thread Sandy Ryza
Yes, having many more cores than disks and all writing at the same time can definitely cause performance issues. Though that wouldn't explain the high GC. What percent of task time does the web UI report that tasks are spending in GC? On Fri, Feb 6, 2015 at 12:56 AM, Guillermo Ortiz

Re: one is the default value for intercepts in GeneralizedLinearAlgorithm

2015-02-06 Thread Tamas Jambor
Thanks for the reply. Seems it is all set to zero in the latest code - I was checking 1.2 last night. On Fri Feb 06 2015 at 07:21:35 Sean Owen so...@cloudera.com wrote: It looks like the initial intercept term is 1 only in the addIntercept numOfLinearPredictor == 1 case. It does seem

Re: checking

2015-02-06 Thread Arush Kharbanda
Yes they are. On Fri, Feb 6, 2015 at 5:06 PM, Mohit Durgapal durgapalmo...@gmail.com wrote: Just wanted to know If my emails are reaching the user list. Regards Mohit -- [image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com *Arush Kharbanda* || Technical Teamlead

Reg GraphX APSP

2015-02-06 Thread Deep Pradhan
Hi, Is the implementation of All Pairs Shortest Path on GraphX for directed graphs or undirected graph? When I use the algorithm with dataset, it assumes that the graph is undirected. Has anyone come across that earlier? Thank you

Re: spark streaming from kafka real time + batch processing in java

2015-02-06 Thread Andrew Psaltis
Mohit, I want to process the data in real-time as well as store the data in hdfs in year/month/day/hour/ format. Are you wanting to process it and then put it into HDFS or just put the raw data into HDFS? If the later then why not just use Camus ( https://github.com/linkedin/camus), it will

Question about recomputing lost partition of rdd ?

2015-02-06 Thread Kartheek.R
Hi, I have this doubt: Assume that an rdd is stored across multiple nodes and one of the nodes fails. So, a partition is lost. Now, I know that when this node is back, it uses the lineage from its neighbours and recomputes that partition alone. 1) How does it get the source data (original data

How do I set spark.local.dirs?

2015-02-06 Thread Joe Wass
I'm running on EC2 and I want to set the directory to use on the slaves (mounted EBS volumes). I have set: spark.local.dir /vol3/my-spark-dir in /root/spark/conf/spark-defaults.conf and replicated to all nodes. I have verified that in the console the value in the config corresponds. I

Re: Parsing CSV files in Spark

2015-02-06 Thread Charles Feduke
I've been doing a bunch of work with CSVs in Spark, mostly saving them as a merged CSV (instead of the various part-n files). You might find the following links useful: - This article is about combining the part files and outputting a header as the first line in the merged results:

Re: Link existing Hive to Spark

2015-02-06 Thread Ashutosh Trivedi (MT2013030)
Hi Todd, Thanks for the input. I use IntelliJ as IDE and I create a SBT project. And in build.sbt I write all the dependencies in build.sbt. For example hive,spark-sql etc. These dependencies stays in local ivy2 repository after getting downloaded from maven central. Should I go in ivy2 and

RE: get null potiner exception newAPIHadoopRDD.map()

2015-02-06 Thread Sun, Vincent Y
Thanks. The data is there, I have checked the row count and dump to file. -Vincent From: Ted Yu [mailto:yuzhih...@gmail.com] Sent: Thursday, February 05, 2015 2:28 PM To: Sun, Vincent Y Cc: user Subject: Re: get null potiner exception newAPIHadoopRDD.map() Is it possible that

Re: Parsing CSV files in Spark

2015-02-06 Thread Sean Owen
You can do this manually without much trouble: get your files on a distributed store like HDFS, read them with textFile, filter out headers, parse with a CSV library like Commons CSV, select columns, format and store the result. That's tens of lines of code. However you probably want to start by

RE: How to design a long live spark application

2015-02-06 Thread Shuai Zheng
Thanks. I think about it, yes, the DAG engine should not have issue to build the right graph in different threads (at least in theory, it is not an issue). So now I have another question: if I have a context initiated, but there is no operation on it for very long time, will there a timeout

Re: How do I set spark.local.dirs?

2015-02-06 Thread Ted Yu
Can you try setting SPARK_LOCAL_DIRS in spark-env.sh ? Cheers On Fri, Feb 6, 2015 at 7:30 AM, Joe Wass jw...@crossref.org wrote: I'm running on EC2 and I want to set the directory to use on the slaves (mounted EBS volumes). I have set: spark.local.dir /vol3/my-spark-dir in

Re: spark streaming from kafka real time + batch processing in java

2015-02-06 Thread Charles Feduke
Good questions, some of which I'd like to know the answer to. Is it okay to update a NoSQL DB with aggregated counts per batch interval or is it generally stored in hdfs? This depends on how you are going to use the aggregate data. 1. Is there a lot of data? If so, and you are going to use

Re: Link existing Hive to Spark

2015-02-06 Thread Todd Nist
Hi Ashu, Per the documents: Configuration of Hive is done by placing your hive-site.xml file in conf/. For example, you can place a something like this in your $SPARK_HOME/conf/hive-site.xml file: configuration property namehive.metastore.uris/name *!-- Ensure that the following statement

Re: How do I set spark.local.dirs?

2015-02-06 Thread Charles Feduke
Did you restart the slaves so they would read the settings? You don't need to start/stop the EC2 cluster, just the slaves. From the master node: $SPARK_HOME/sbin/stop-slaves.sh $SPARK_HOME/sbin/start-slaves.sh ($SPARK_HOME is probably /root/spark) On Fri Feb 06 2015 at 10:31:18 AM Joe Wass

Re: Parsing CSV files in Spark

2015-02-06 Thread Mohit Jaggi
As Sean said, this is just a few lines of code. You can see an example here: https://github.com/AyasdiOpenSource/bigdf/blob/master/src/main/scala/com/ayasdi/bigdf/DF.scala#L660 https://github.com/AyasdiOpenSource/bigdf/blob/master/src/main/scala/com/ayasdi/bigdf/DF.scala#L660 On Feb 6, 2015,

Spark Driver Host under Yarn

2015-02-06 Thread Al M
I'm running Spark 1.2 with Yarn. My logs show that my executors are failing to connect to my driver. This is because they are using the wrong hostname. Since I'm running with Yarn, I can't set spark.driver.host as explained in SPARK-4253. So it should come from my HDFS configuration. Do you

matrix of random variables with spark.

2015-02-06 Thread Luca Puggini
Hi all, this is my first email with this mailing list and I hope that I am not doing anything wrong. I am currently trying to define a distributed matrix with n rows and k columns where each element is randomly sampled by a uniform distribution. How can I do that? It would be also nice if you

Beginner in Spark

2015-02-06 Thread King sami
Hi, I'm new in Spark, I'd like to install Spark with Scala. The aim is to build a data processing system foor door events. the first step is install spark, scala, hdfs and other required tools. the second is build the algorithm programm in Scala which can treat a file of my data logs (events).

Re: Question about recomputing lost partition of rdd ?

2015-02-06 Thread Sean Owen
I think there are a number of misconceptions here. It is not necessary that the original node come back in order to recreate the lost partition. The lineage is not retrieved from neighboring nodes. The source data is retrieved in the same way that it was the first time that the partition was

Re: Link existing Hive to Spark

2015-02-06 Thread Ashutosh Trivedi (MT2013030)
ok.Is there no way to specify it in code, when I create SparkConf ? From: Todd Nist tsind...@gmail.com Sent: Friday, February 6, 2015 10:08 PM To: Ashutosh Trivedi (MT2013030) Cc: user@spark.apache.org Subject: Re: Link existing Hive to Spark You can always just