issue creating spark context with CDH 5.3.1

2015-03-09 Thread sachin Singh
Hi, I am using CDH5.3.1 I am getting bellow error while, even spark context not getting created, I am submitting my job like this - submitting command- spark-submit --jars

Re: issue creating spark context with CDH 5.3.1

2015-03-09 Thread sachin Singh
I have copied hive-site.xml to spark conf folder cp /etc/hive/conf/hive-site.xml /usr/lib/spark/conf -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/issue-creating-spark-context-with-CDH-5-3-1-tp21968p21969.html Sent from the Apache Spark User List mailing

Re: issue creating spark context with CDH 5.3.1

2015-03-09 Thread Sean Owen
This one is CDH-specific and is already answered in the forums, so I'd go there instead. Ex: http://community.cloudera.com/t5/Advanced-Analytics-Apache-Spark/Spark-sql-and-Hive-tables/td-p/22051 On Mon, Mar 9, 2015 at 12:33 PM, sachin Singh sachin.sha...@gmail.com wrote: Hi, I am using

Read Parquet file from scala directly

2015-03-09 Thread Shuai Zheng
Hi All, I have a lot of parquet files, and I try to open them directly instead of load them into RDD in driver (so I can optimize some performance through special logic). But I do some research online and can't find any example to access parquet directly from scala, anyone has done this

Re: Solve least square problem of the form min norm(A x - b)^2^ + lambda * n * norm(x)^2 ?

2015-03-09 Thread Burak Yavuz
Hi Jaonary, The RowPartitionedMatrix is a special case of the BlockMatrix, where the colsPerBlock = nCols. I hope that helps. Burak On Mar 6, 2015 9:13 AM, Jaonary Rabarisoa jaon...@gmail.com wrote: Hi Shivaram, Thank you for the link. I'm trying to figure out how can I port this to mllib.

Re: what are the types of tasks when running ALS iterations

2015-03-09 Thread Burak Yavuz
+user On Mar 9, 2015 8:47 AM, Burak Yavuz brk...@gmail.com wrote: Hi, In the web UI, you don't see every single task. You see the name of the last task before the stage boundary (which is a shuffle like a groupByKey), which in your case is a flatMap. Therefore you only see flatMap in the UI.

Re: failure to display logs on YARN UI with log aggregation on

2015-03-09 Thread Ted Yu
See http://search-hadoop.com/m/JW1q5AneoE1 Cheers On Mon, Mar 9, 2015 at 7:29 AM, rok rokros...@gmail.com wrote: I'm using log aggregation on YARN with Spark and I am not able to see the logs through the YARN web UI after the application completes: Failed redirect for

How to preserve/preset partition information when load time series data?

2015-03-09 Thread Shuai Zheng
Hi All, If I have a set of time series data files, they are in parquet format and the data for each day are store in naming convention, but I will not know how many files for one day. 20150101a.parq 20150101b.parq 20150102a.parq 20150102b.parq 20150102c.parq . 201501010a.parq .

distcp problems on ec2 standalone spark cluster

2015-03-09 Thread roni
I got pass the issues with the cluster not started problem by adding Yarn to mapreduce.framework.name . But when I try to to distcp , if I use uRI with s3://path to my bucket .. I get invalid path even though the bucket exists. If I use s3n:// it just hangs. Did anyone else face anything like

saveAsTextFile extremely slow near finish

2015-03-09 Thread mingweili0x
I'm basically running a sorting using spark. The spark program will read from HDFS, sort on composite keys, and then save the partitioned result back to HDFS. pseudo code is like this: input = sc.textFile pairs = input.mapToPair sorted = pairs.sortByKey values = sorted.values

java.lang.RuntimeException: Couldn't find function Some

2015-03-09 Thread Patcharee Thongtra
Hi, In my spark application I queried a hive table and tried to take only one record, but got java.lang.RuntimeException: Couldn't find function Some val rddCoOrd = sql(SELECT date, x, y FROM coordinate where order by date limit 1) valresultCoOrd = rddCoOrd.take(1)(0) Any ideas? I

GraphX Snapshot Partitioning

2015-03-09 Thread Matthew Bucci
Hello, I am working on a project where we want to split graphs of data into snapshots across partitions and I was wondering what would happen if one of the snapshots we had was too large to fit into a single partition. Would the snapshot be split over the two partitions equally, for example, and

Top, takeOrdered, sortByKey

2015-03-09 Thread Saba Sehrish
From: Saba Sehrish ssehr...@fnal.govmailto:ssehr...@fnal.gov Date: March 9, 2015 at 4:11:07 PM CDT To: user-...@spark.apache.orgmailto:user-...@spark.apache.org Subject: Using top, takeOrdered, sortByKey I am using spark for a template matching problem. We have 77 million events in the

RE: sc.textFile() on windows cannot access UNC path

2015-03-09 Thread java8964
This is a Java problem, not really Spark. From this page: http://stackoverflow.com/questions/18520972/converting-java-file-url-to-file-path-platform-independent-including-u You can see that using Java.nio.* on JDK 7, it will fix this issue. But Path class in Hadoop will use java.io.*, instead of

error on training with logistic regression sgd

2015-03-09 Thread Peng Xia
Hi, I was launching a spark cluster with 4 work nodes, each work nodes contains 8 cores and 56gb ram, and I was testing my logistic regression problem. The training set is around 1.2 million records.When I was using 2**10 (1024) features, the whole program works fine, but when I use 2**14

yarn + spark deployment issues (high memory consumption and task hung)

2015-03-09 Thread pranavkrs
Yarn+ Spark: I am running my spark job (on yarn) on 6 data node cluster of 512GB each. I was having tough time configuring it since the job would hang in one or more tasks on any of the executors for indefinite time. The stage can be as simple as rdd count. And the bottleneck point is not always

sc.textFile() on windows cannot access UNC path

2015-03-09 Thread Wang, Ningjun (LNG-NPV)
I am running Spark on windows 2008 R2. I use sc.textFile() to load text file using UNC path, it does not work. sc.textFile(rawfile:10.196.119.230/folder1/abc.txt, 4).count() Input path does not exist: file:/10.196.119.230/folder1/abc.txt org.apache.hadoop.mapred.InvalidInputException:

Re: MLlib/kmeans newbie question(s)

2015-03-09 Thread Xiangrui Meng
You need to change `== 1` to `== i`. `println(t)` happens on the workers, which may not be what you want. Try the following: noSets.filter(t = model.predict(Utils.featurize(t)) == i).collect().foreach(println) -Xiangrui On Sat, Mar 7, 2015 at 3:20 PM, Pierce Lamb richard.pierce.l...@gmail.com

Re: Spark Streaming input data source list

2015-03-09 Thread Tathagata Das
Spark Streaming has StreamingContext.socketStream() http://spark.apache.org/docs/1.2.1/api/java/org/apache/spark/streaming/StreamingContext.html#socketStream(java.lang.String, int, scala.Function1, org.apache.spark.storage.StorageLevel, scala.reflect.ClassTag) TD On Mon, Mar 9, 2015 at 11:37 AM,

Re: Can't cache RDD of collaborative filtering on MLlib

2015-03-09 Thread Xiangrui Meng
cache() is lazy. The data is stored into memory after the first time it gets materialized. So the first time you call `predict` after you load the model back from HDFS, it still takes time to load the actual data. The second time will be much faster. Or you can call `userJavaRDD.count()` and

From Spark web ui, how to prove the parquet column pruning working

2015-03-09 Thread java8964
Hi, Currently most of the data in our production is using Avro + Snappy. I want to test the benefits if we store the data in Parquet format. I changed the our ETL to generate the Parquet format, instead of Avor, and want to test a simple sql in Spark SQL, to verify the benefits from Parquet. I

Joining data using Latitude, Longitude

2015-03-09 Thread Ankur Srivastava
Hi, I am trying to join data based on the latitude and longitude. I have reference data which has city information with their latitude and longitude. I have a data source with user information with their latitude and longitude. I want to find the nearest city to the user's latitude and

Spark Streaming input data source list

2015-03-09 Thread Cui Lin
Dear all, Could you send me a list for input data source that spark streaming could support? My list is HDFS, Kafka, textfile?… I am wondering if spark streaming could directly read data from certain port (443 e.g.) that my devices directly send to? Best regards, Cui Lin

Re: Process time series RDD after sortByKey

2015-03-09 Thread Zhan Zhang
Does the code flow similar to following work for you, which processes each partition of an RDD sequentially? while( iterPartition RDD.partitions.length) { val res = sc.runJob(this, (it: Iterator[T]) = somFunc, iterPartition, allowLocal = true) Some other function after processing

Process time series RDD after sortByKey

2015-03-09 Thread Shuai Zheng
Hi All, I am processing some time series data. For one day, it might has 500GB, then for each hour, it is around 20GB data. I need to sort the data before I start process. Assume I can sort them successfully dayRDD.sortByKey but after that, I might have thousands of partitions (to

sparse vector operations in Python

2015-03-09 Thread Daniel, Ronald (ELS-SDG)
Hi, Sorry to ask this, but how do I compute the sum of 2 (or more) mllib SparseVectors in Python? Thanks, Ron - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail:

Is there any problem in having a long opened connection to spark sql thrift server

2015-03-09 Thread fanooos
I have some applications developed using PHP and currently we have a problem in connecting these applications to spark sql thrift server. ( Here is the problem I am talking about. http://apache-spark-user-list.1001560.n3.nabble.com/Connection-PHP-application-to-Spark-Sql-thrift-server-td21925.html

Re: Optimizing SQL Query

2015-03-09 Thread anamika gupta
Please fine the query plan scala sqlContext.sql(SELECT dw.DAY_OF_WEEK, dw.HOUR, avg(dw.SDP_USAGE) AS AVG_SDP_USAGE FROM (SELECT sdp.WID, DAY_OF_WEEK, HOUR, SUM(INTERVAL_VALUE) AS SDP_USAGE FROM (SELECT * FROM date_d AS dd JOIN interval_f AS intf ON intf.DATE_WID = dd.WID WHERE intf.DATE_WID =

Re: How to use the TF-IDF model?

2015-03-09 Thread Jeffrey Jedele
Hi, well, it really depends on what you want to do ;) TF-IDF is a measure that originates in the information retrieval context and that can be used to judge the relevancy of a document in context of a given search term. It's also often used for text-related machine learning tasks. E.g. have a

Top rows per group

2015-03-09 Thread Moss
I do have a schemaRDD where I want to group by a given field F1, but want the result to be not a single row per group but multiple rows per group where only the rows that have the N top F2 field values are kept. The issue is that the groupBy operation is an aggregation of multiple rows to a

Spark History server default conf values

2015-03-09 Thread Srini Karri
Hi All, What are the default values for the following conf properities if we don't set in the conf file? # spark.history.fs.updateInterval 10 # spark.history.retainedApplications 500 Regards, Srini.

Re: GraphX Snapshot Partitioning

2015-03-09 Thread Takeshi Yamamuro
Hi, Vertices are simply hash-paritioned by their 64-bit IDs, so they are evenly spread over parititons. As for edges, GraphLoader#edgeList builds edge paritions through hadoopFile(), so the initial parititons depend on InputFormat#getSplits implementations (e.g, partitions are mostly equal to

Re: Spark Streaming input data source list

2015-03-09 Thread Tathagata Das
Link to custom receiver guide https://spark.apache.org/docs/latest/streaming-custom-receivers.html On Mon, Mar 9, 2015 at 5:55 PM, Shao, Saisai saisai.s...@intel.com wrote: Hi Lin, AFAIK, currently there’s no built-in receiver API for RDBMs, but you can customize your own receiver to get

RE: Spark Streaming input data source list

2015-03-09 Thread Shao, Saisai
Hi Lin, AFAIK, currently there's no built-in receiver API for RDBMs, but you can customize your own receiver to get data from RDBMs, for the details you can refer to the docs. Thanks Jerry From: Cui Lin [mailto:cui@hds.com] Sent: Tuesday, March 10, 2015 8:36 AM To: Tathagata Das Cc:

RE: sc.textFile() on windows cannot access UNC path

2015-03-09 Thread Wang, Ningjun (LNG-NPV)
Hi Yong Thanks for the reply. Yes it works with local drive letter. But I really need to use UNC path because the path is input from at runtime. I cannot dynamically assign a drive letter to arbitrary UNC path at runtime. Is there any work around that I can use UNC path for sc.textFile(...)?

RE: A strange problem in spark sql join

2015-03-09 Thread Dai, Kevin
No, I don’t have tow master instances. From: Akhil Das [mailto:ak...@sigmoidanalytics.com] Sent: 2015年3月9日 15:03 To: Dai, Kevin Cc: user@spark.apache.org Subject: Re: A strange problem in spark sql join Make sure you don't have two master instances running on the same machine. It could happen

How to use the TF-IDF model?

2015-03-09 Thread Xi Shen
Hi, I read this page, http://spark.apache.org/docs/1.2.0/mllib-feature-extraction.html. But I am wondering, how to use this TF-IDF RDD? What is this TF-IDF vector looks like? Can someone provide me some guide? Thanks, [image: --] Xi Shen [image: http://]about.me/davidshen

A strange problem in spark sql join

2015-03-09 Thread Dai, Kevin
Hi, guys I encounter a strange problem as follows: I joined two tables(which are both parquet files) and then did the groupby. The groupby took 19 hours to finish. However, when I kill this job twice in the groupby stage. The third try will su But after I killed this job and run it again. It

How to load my ML model?

2015-03-09 Thread Xi Shen
Hi, I used the method on this http://databricks.gitbooks.io/databricks-spark-reference-applications/content/twitter_classifier/train.html passage to save my k-means model. But now, I have no idea how to load it back...I tried sc.objectFile(/path/to/data/file/directory/) But I got this error:

Re: A strange problem in spark sql join

2015-03-09 Thread Akhil Das
Make sure you don't have two master instances running on the same machine. It could happen like you were running the job and in the middle you tried to stop the cluster which didn't completely stopped it and you did a start-all again which will eventually end up having 2 master instances running,

Re: No executors allocated on yarn with latest master branch

2015-03-09 Thread Sandy Ryza
You would have needed to configure it by setting yarn.scheduler.capacity.resource-calculator to something ending in DominantResourceCalculator. If you haven't configured it, there's a high probability that the recently committed https://issues.apache.org/jira/browse/SPARK-6050 will fix your

what are the types of tasks when running ALS iterations

2015-03-09 Thread lisendong
you see, the core of ALS 1.0.0 is the following code: there should be flatMap and groupByKey when running ALS iterations , right? but when I run als iteration, there are ONLY flatMap tasks... do you know why? private def updateFeatures( products: RDD[(Int,

Re: A way to share RDD directly using Tachyon?

2015-03-09 Thread Akhil Das
Did you try something like: myRDD.saveAsObjectFile(tachyon://localhost:19998/Y) val newRDD = sc.objectFile[MyObject](tachyon://localhost:19998/Y) Thanks Best Regards On Sun, Mar 8, 2015 at 3:59 PM, Yijie Shen henry.yijies...@gmail.com wrote: Hi, I would like to share a RDD in several Spark

Ensuring data locality when opening files

2015-03-09 Thread Daniel Haviv
Hi, We wrote a spark steaming app that receives file names on HDFS from Kafka and opens them using Hadoop's libraries. The problem with this method is that I'm not utilizing data locality because any worker might open any file without giving precedence to data locality. I can't open the files

How to build Spark and run examples using Intellij ?

2015-03-09 Thread MEETHU MATHEW
Hi, I am trying to  run examples of spark(master branch from git)  from Intellij(14.0.2) but facing errors. These are the steps I followed: 1. git clone the master branch of apache spark.2. Build it using mvn -DskipTests clean install3. In Intellij  select Import Projects and choose the POM.xml