Count-based windows

2014-12-08 Thread Tobias Pfeiffer
Hi, I am interested in building an application that uses sliding windows not based on the time when the item was received, but on either * a timestamp embedded in the data, or * a count (like: every 10 items, look at the last 100 items). Also, I want to do this on stream data received from

How can I make Spark Streaming count the words in a file in a unit test?

2014-12-08 Thread Emre Sevinc
Hello, I've successfully built a very simple Spark Streaming application in Java that is based on the HdfsCount example in Scala at https://github.com/apache/spark/blob/branch-1.1/examples/src/main/scala/org/apache/spark/examples/streaming/HdfsWordCount.scala . When I submit this application to

Conference in Paris next Thursday

2014-12-08 Thread Alexis Seigneurin
Hi, I know there are not so many conferences on Spark in Paris, so I just wanted to let you know you that Ippon will be holding one on Thursday next week (11th of December): http://blog.ippon.fr/2014/12/03/ippevent-spark-ou-comment-traiter-des-donnees-a-la-vitesse-de-leclair/ There will be 3

column level encryption/decryption with key management

2014-12-08 Thread Chirag Aggarwal
Hi, There have been some efforts going on in providing column level encryption/decryption on hive tables. https://issues.apache.org/jira/browse/HIVE-7934 Is there any plan to extend the functionality over sparksql also? Thanks, Chirag

Re: Spark SQL: How to get the hierarchical element with SQL?

2014-12-08 Thread Raghavendra Pandey
Yeah, the dot notation works. It works even for arrays. But I am not sure if it can handle complex hierarchies. On Mon Dec 08 2014 at 11:55:19 AM Cheng Lian lian.cs@gmail.com wrote: You may access it via something like SELECT filterIp.element FROM tb, just like Hive. Or if you’re using

Programmatically running spark jobs using yarn-client

2014-12-08 Thread Aniket Bhatnagar
I am trying to create (yet another) spark as a service tool that lets you submit jobs via REST APIs. I think I have nearly gotten it to work baring a few issues. Some of which seem already fixed in 1.2.0 (like SPARK-2889) but I have hit the road block with the following issue. I have created a

Re: Efficient self-joins

2014-12-08 Thread Koert Kuipers
spark can do efficient joins if both RDDs have the same partitioner. so in case of self join I would recommend to create an rdd that has explicit partitioner and has been cached. On Dec 8, 2014 8:52 AM, Theodore Vasiloudis theodoros.vasilou...@gmail.com wrote: Hello all, I am working on a

Error when mapping a schema RDD when converting lists

2014-12-08 Thread sahanbull
Hi Guys, I used applySchema to store a set of nested dictionaries and lists in a parquet file. http://apache-spark-user-list.1001560.n3.nabble.com/Using-sparkSQL-to-convert-a-collection-of-python-dictionary-of-dictionaries-to-schma-RDD-td20228.html#a20461 It was successful and i could

Locking for shared RDDs

2014-12-08 Thread aditya.athalye
I am relatively new to Spark. I am planning to use Spark Streaming for my OLAP use case, but I would like to know how RDDs are shared between multiple workers. If I need to constantly compute some stats on the streaming data, presumably shared state would have to updated serially by different

Re: Error when mapping a schema RDD when converting lists

2014-12-08 Thread sahanbull
As a tempary fix, it works when I convert field six to a list manually. That is: def generateRecords(line): # input : the row stored in parquet file # output : a python dictionary with all the key value pairs field1 = line.field1 summary = {}

Re: Efficient self-joins

2014-12-08 Thread Daniel Darabos
I do not see how you hope to generate all incoming edge pairs without repartitioning the data by dstID. You need to perform this shuffle for joining too. Otherwise two incoming edges could be in separate partitions and never meet. Am I missing something? On Mon, Dec 8, 2014 at 3:53 PM, Theodore

Re: spark Exception while performing saveAsTextFiles

2014-12-08 Thread Akhil Das
Check in your worker logs for exact reason, if you let the job run for 2 days then mostly this is because of you ran out of disk space or something. Looking at the worker logs will give you a clear picture. Thanks Best Regards On Mon, Dec 8, 2014 at 12:49 PM, Hafiz Mujadid

Re: monitoring for spark standalone

2014-12-08 Thread Akhil Das
You can setup and customize nagios for all these requirements. Or you can use Ganglia if you are not looking for something with alerts (email etc) Thanks Best Regards On Mon, Dec 8, 2014 at 1:05 PM, Judy Nash judyn...@exchange.microsoft.com wrote: Hello, Are there ways we can

Re: Programmatically running spark jobs using yarn-client

2014-12-08 Thread Akhil Das
How are you submitting the job? You need to create a jar of your code (sbt package will give you one inside target/scala-*/projectname-*.jar) and then use it while submitting. If you are not using spark-submit then you can simply add this jar to spark by

Re: Spark SQL: How to get the hierarchical element with SQL?

2014-12-08 Thread Alessandro Panebianco
I went through complex hierarchal JSON structures and Spark seems to fail in querying them no matter what syntax is used. Hope this helps, Regards, Alessandro On Dec 8, 2014, at 6:05 AM, Raghavendra Pandey raghavendra.pan...@gmail.com wrote: Yeah, the dot notation works. It works even

Re: Efficient self-joins

2014-12-08 Thread Theodore Vasiloudis
@Daniel It's true that the first map in your code is needed, i.e. mapping so that dstID is the new RDD key. The self-join on the dstKey will then create all the pairs of incoming edges (plus self-referential and duplicates that need to be filtered out). @Koert Are there any guidelines about

Re: Efficient self-joins

2014-12-08 Thread Daniel Darabos
On Mon, Dec 8, 2014 at 5:26 PM, Theodore Vasiloudis theodoros.vasilou...@gmail.com wrote: @Daniel It's true that the first map in your code is needed, i.e. mapping so that dstID is the new RDD key. You wrote groupByKey is highly inefficient due to the need to shuffle all the data, but you

Install Apache Spark on a Cluster

2014-12-08 Thread riginos
My thesis is related to big data mining and I have a cluster in the laboratory of my university. My task is to install apache spark on it and use it for extraction purposes. Is there any understandable guidance on how to do this ? -- View this message in context:

Re: Install Apache Spark on a Cluster

2014-12-08 Thread Ritesh Kumar Singh
On a rough note, Step 1: Install Hadoop2.x in all the machines on cluster Step 2: Check if Hadoop cluster is working Step 3: Setup Apache Spark as given on the documentation page for the cluster. Check the status of cluster on the master UI As it is some data mining project, configure Hive too.

Error: Spark-streaming to Cassandra

2014-12-08 Thread m.sarosh
Hi, I am intending to save the streaming data from kafka into Cassandra, using spark-streaming: But there seems to be problem with line javaFunctions(data).writerBuilder(testkeyspace, test_table, mapToRow(TestTable.class)).saveToCassandra(); I am getting 2 errors. the code, the error-log and

Re: Error when mapping a schema RDD when converting lists

2014-12-08 Thread Davies Liu
This is fixed in 1.2. Also, in 1.2+ you could call row.asDict() to convert the Row object into dict. On Mon, Dec 8, 2014 at 6:38 AM, sahanbull sa...@skimlinks.com wrote: Hi Guys, I used applySchema to store a set of nested dictionaries and lists in a parquet file.

Re: Efficient self-joins

2014-12-08 Thread Theodore Vasiloudis
@Daniel Not an expert either, I'm just going by what I see performance-wise currently. Our groupByKey implementation was more than an order of magnitude slower than using the self join and then reduceByKey. FTA: *pairs on the same machine with the same key are combined (by using the lamdba

Re: How can I make Spark Streaming count the words in a file in a unit test?

2014-12-08 Thread Burak Yavuz
Hi, https://github.com/databricks/spark-perf/tree/master/streaming-tests/src/main/scala/streaming/perf contains some performance tests for streaming. There are examples of how to generate synthetic files during the test in that repo, maybe you can find some code snippets that you can use there.

Spark-SQL JDBC driver

2014-12-08 Thread Anas Mosaad
Hello Everyone, I'm brand new to spark and was wondering if there's a JDBC driver to access spark-SQL directly. I'm running spark in standalone mode and don't have hadoop in this environment. -- *Best Regards/أطيب المنى,* *Anas Mosaad*

RE: Spark-SQL JDBC driver

2014-12-08 Thread Judy Nash
You can use thrift server for this purpose then test it with beeline. See doc: https://spark.apache.org/docs/latest/sql-programming-guide.html#running-the-thrift-jdbc-server From: Anas Mosaad [mailto:anas.mos...@incorta.com] Sent: Monday, December 8, 2014 11:01 AM To: user@spark.apache.org

Re: Hive Problem in Pig generated Parquet file schema in CREATE EXTERNAL TABLE (e.g. bag::col1)

2014-12-08 Thread Michael Armbrust
This is by hive's design. From the Hive documentation: The column change command will only modify Hive's metadata, and will not modify data. Users should make sure the actual data layout of the table/partition conforms with the metadata definition. On Sat, Dec 6, 2014 at 8:28 PM, Jianshi

SOLVED -- Re: scopt.OptionParser

2014-12-08 Thread Caron
Update: The issue in my previous post was solved: I had to change the sbt file name from project_name.sbt to build.sbt. - Thanks! -Caron -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/scopt-OptionParser-tp8436p20581.html Sent from the Apache Spark

Re: spark-submit on YARN is slow

2014-12-08 Thread Sandy Ryza
Hey Tobias, Can you try using the YARN Fair Scheduler and set yarn.scheduler.fair.continuous-scheduling-enabled to true? -Sandy On Sun, Dec 7, 2014 at 5:39 PM, Tobias Pfeiffer t...@preferred.jp wrote: Hi, thanks for your responses! On Sat, Dec 6, 2014 at 4:22 AM, Sandy Ryza

Re: Convert RDD[Map[String, Any]] to SchemaRDD

2014-12-08 Thread Yin Huai
Hello Jianshi, You meant you want to convert a Map to a Struct, right? We can extract some useful functions from JsonRDD.scala, so others can access them. Thanks, Yin On Mon, Dec 8, 2014 at 1:29 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: I checked the source code for inferSchema. Looks

Re: How can I create an RDD with millions of entries created programmatically

2014-12-08 Thread Daniel Darabos
Hi, I think you have the right idea. I would not even worry about flatMap. val rdd = sc.parallelize(1 to 100, numSlices = 1000).map(x = generateRandomObject(x)) Then when you try to evaluate something on this RDD, it will happen partition-by-partition. So 1000 random objects will be

Re: representing RDF literals as vertex properties

2014-12-08 Thread spr
OK, have waded into implementing this and have gotten pretty far, but am now hitting something I don't understand, an NoSuchMethodError. The code looks like [...] val conf = new SparkConf().setAppName(appName) //conf.set(fs.default.name, file://); val sc = new

MLLIb: Linear regression: Loss was due to java.lang.ArrayIndexOutOfBoundsException

2014-12-08 Thread Sameer Tilak
Hi All, I was able to run LinearRegressionwithSGD for a largeer dataset ( 2GB sparse). I have now filtered the data and I am running regression on a subset of it (~ 200 MB). I see this error, which is strange since it was running fine with the superset data. Is this a formatting issue

Re: Why KMeans with mllib is so slow ?

2014-12-08 Thread DB Tsai
You just need to use the latest master code without any configuration to get performance improvement from my PR. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Mon, Dec 8, 2014 at 7:53

Re: How can I create an RDD with millions of entries created programmatically

2014-12-08 Thread Steve Lewis
looks good but how do I say that in Java as far as I can see sc.parallelize (in Java) has only one implementation which takes a List - requiring an in memory representation On Mon, Dec 8, 2014 at 12:06 PM, Daniel Darabos daniel.dara...@lynxanalytics.com wrote: Hi, I think you have the right

Re: Is there a way to get column names using hiveContext ?

2014-12-08 Thread Michael Armbrust
You can call .schema on SchemaRDDs. For example: results.schema.fields.map(_.name) On Sun, Dec 7, 2014 at 11:36 PM, abhishek reachabhishe...@gmail.com wrote: Hi, I have iplRDD which is a json, and I do below steps and query through hivecontext. I get the results but without columns

Transform SchemaRDDs into new SchemaRDDs

2014-12-08 Thread Sunita Arvind
Hi, I need to generate some flags based on certain columns and add it back to the schemaRDD for further operations. Do I have to use case class (reflection or programmatically). I am using parquet files, so schema is being automatically derived. This is a great feature. thanks to Spark

Re: Print Node info. of Decision Tree

2014-12-08 Thread Manish Amde
Hi Jake, The toString method should print the full model in versions 1.1.x. The current master branch has a method toDebugString for DecisionTreeModel which should print out all the node classes and the toString method has been updated to print the summary only so there is a slight change in the

Learning rate or stepsize automation

2014-12-08 Thread Bui, Tri
Hi, Is there any way to auto calculate the optimum learning rate or stepsize via MLLIB for SGD ? Thx tri

Re: representing RDF literals as vertex properties

2014-12-08 Thread Ankur Dave
At 2014-12-08 12:12:16 -0800, spr s...@yarcdata.com wrote: OK, have waded into implementing this and have gotten pretty far, but am now hitting something I don't understand, an NoSuchMethodError. [...] The (short) traceback looks like Exception in thread main java.lang.NoSuchMethodError:

Re: Learning rate or stepsize automation

2014-12-08 Thread Debasish Das
Hi Bui, Please use BFGS based solvers...For BFGS you don't have to specify step size since the line search will find sufficient decrease each time... Regularization you still have to do grid search...it's not possible to automate that but on master you will find nice ways to automate grid

Error outputing to CSV file

2014-12-08 Thread DataNut
I am trying to create a CSV file, I have managed to create the actual string I want to output to a file, but when I try to write the file, I get the following error. /saveAsTextFile is not a member of String/ My string is perfect, when I call this line to actually save the file, I get the above

Re: run JavaAPISuite with mavem

2014-12-08 Thread Ted Yu
Running JavaAPISuite (master branch) on Linux, I got: testGuavaOptional(org.apache.spark.JavaAPISuite) Time elapsed: 32.945 sec ERROR! org.apache.spark.SparkException: Job aborted due to stage failure: Master removed our application: FAILED at org.apache.spark.scheduler.DAGScheduler.org

Re: Efficient self-joins

2014-12-08 Thread Koert Kuipers
Yeah, spark has very little overhead per partition, so generally more partitions is better. On Mon, Dec 8, 2014 at 1:46 PM, Theodore Vasiloudis theodoros.vasilou...@gmail.com wrote: @Daniel Not an expert either, I'm just going by what I see performance-wise currently. Our groupByKey

In Java how can I create an RDD with a large number of elements

2014-12-08 Thread Steve Lewis
assume I don't care about values which may be created in a later map - in scala I can say val rdd = sc.parallelize(1 to 10, numSlices = 1000) but in Java JavaSparkContext can only paralellize a List - limited to Integer,MAX_VALUE elements and required to exist in memory - the best I can

RE: Fair scheduling accross applications in stand-alone mode

2014-12-08 Thread Mohammed Guller
Hi - Does anybody have any ideas how to dynamically allocate cores instead of statically partitioning them among multiple applications? Thanks. Mohammed From: Mohammed Guller Sent: Friday, December 5, 2014 11:26 PM To: user@spark.apache.org Subject: Fair scheduling accross applications in

Is there an efficient way to append new data to a registered Spark SQL Table?

2014-12-08 Thread Xuelin Cao
Hi,       I'm wondering whether there is an  efficient way to continuously append new data to a registered spark SQL table.       This is what I want:      I want to make an ad-hoc query service to a json formated system log. Certainly, the system log is continuously generated. I will use

Can not write out data as snappy-compressed files

2014-12-08 Thread Tao Xiao
I'm using CDH 5.1.0 and Spark 1.0.0, and I'd like to write out data as snappy-compressed files but encounted a problem. My code is as follows: val InputTextFilePath = hdfs://ec2.hadoop.com:8020/xt/text/new.txt val OutputTextFilePath = hdfs://ec2.hadoop.com:8020/xt/compressedText/ val

Re: In Java how can I create an RDD with a large number of elements

2014-12-08 Thread praveen seluka
Steve, Something like this will do I think = sc.parallelize(1 to 1000, 1000).flatMap(x = 1 to 10) the above will launch 1000 tasks (maps), with each task creating 10^5 numbers (total of 100 million elements) On Mon, Dec 8, 2014 at 6:17 PM, Steve Lewis lordjoe2...@gmail.com wrote: assume

query classification using Apache spark Mlib

2014-12-08 Thread Huang,Jin
I have a question as the title says, the question link is http://stackoverflow.com/questions/27370170/query-classification-using-apache-spark-mlib,thanks Jin

How to convert RDD to JSON?

2014-12-08 Thread YaoPau
Pretty straightforward: Using Scala, I have an RDD that represents a table with four columns. What is the recommended way to convert the entire RDD to one JSON object? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-convert-RDD-to-JSON-tp20585.html

Re: spark-submit on YARN is slow

2014-12-08 Thread Tobias Pfeiffer
Hi, On Tue, Dec 9, 2014 at 4:39 AM, Sandy Ryza sandy.r...@cloudera.com wrote: Can you try using the YARN Fair Scheduler and set yarn.scheduler.fair.continuous-scheduling-enabled to true? I'm using Cloudera 5.2.0 and my configuration says yarn.resourcemanager.scheduler.class =

Re: Hive Problem in Pig generated Parquet file schema in CREATE EXTERNAL TABLE (e.g. bag::col1)

2014-12-08 Thread Jianshi Huang
Ah... I see. Thanks for pointing it out. Then it means we cannot mount external table using customized column names. hmm... Then the only option left is to use a subquery to add a bunch of column alias. I'll try it later. Thanks, Jianshi On Tue, Dec 9, 2014 at 3:34 AM, Michael Armbrust

Re: Locking for shared RDDs

2014-12-08 Thread Raghavendra Pandey
You don't need to worry about locks as such as one thread/worker is responsible exclusively for one partition of the RDD. You can use Accumulator variables that spark provides to get the state updates. On Mon Dec 08 2014 at 8:14:28 PM aditya.athalye adbrihadarany...@gmail.com wrote: I am

Saving Data only if Dstream is not empty

2014-12-08 Thread Hafiz Mujadid
Hi Experts! I want to save DStream to HDFS only if it is not empty such that it contains some kafka messages to be stored. What is an efficient way to do this. var data = KafkaUtils.createStream[Array[Byte], Array[Byte], DefaultDecoder, DefaultDecoder](ssc, params, topicMap,

Writing and reading file faster than memory option

2014-12-08 Thread Malte
I am facing a somewhat confusing problem: My spark app reads data from a database, calculates certain values and then runs a shortest path Pregel operation on them. If I save the RDD to disk and then read the information out again, my app runs between 30-50% faster than keeping it in memory, plus

spark broadcast unavailable

2014-12-08 Thread 十六夜涙
Hi allIn my spark application,I load a csv file and map the datas to a Map vairable for later uses on driver node ,then broadcast it,every thing works fine untill the exception java.io.FileNotFoundException occurs.the console log information shows me the broadcast unavailable,I googled this

Re: spark broadcast unavailable

2014-12-08 Thread Akhil Das
You cannot pass the sc object (*val b = Utils.load(sc,ip_lib_path)*) inside a map function and that's why the Serialization exception is popping up( since sc is not serializable). You can try tachyon's cache if you want to persist the data in memory kind of forever. Thanks Best Regards On Tue,

Re: vcores used in cluster metrics(yarn resource manager ui) when running spark on yarn

2014-12-08 Thread Sandy Ryza
Hi yuemeng, Are you possibly running the Capacity Scheduler with the default resource calculator? -Sandy On Sat, Dec 6, 2014 at 7:29 PM, yuemeng1 yueme...@huawei.com wrote: Hi, all When i running an app with this cmd: ./bin/spark-sql --master yarn-client --num-executors 2

Re: Spark on YARN memory utilization

2014-12-08 Thread Sandy Ryza
Another thing to be aware of is that YARN will round up containers to the nearest increment of yarn.scheduler.minimum-allocation-mb, which defaults to 1024. -Sandy On Sat, Dec 6, 2014 at 3:48 PM, Denny Lee denny.g@gmail.com wrote: Got it - thanks! On Sat, Dec 6, 2014 at 14:56 Arun Ahuja

PhysicalRDD problem?

2014-12-08 Thread nitin
Hi All, I am facing following problem on Spark-1.2 rc1 where I get Treenode exception (unresolved attributes) :- https://issues.apache.org/jira/browse/SPARK-2063 To avoid this, I do something following :- val newSchemaRDD = sqlContext.applySchema(existingSchemaRDD, existingSchemaRDD.schema)

Re: spark sql - save to Parquet file - Unsupported datatype TimestampType

2014-12-08 Thread ZHENG, Xu-dong
I meet the same issue. Any solution? On Wed, Nov 12, 2014 at 2:54 PM, tridib tridib.sama...@live.com wrote: Hi Friends, I am trying to save a json file to parquet. I got error Unsupported datatype TimestampType. Is not parquet support date? Which parquet version does spark uses? Is there

Re: How to convert RDD to JSON?

2014-12-08 Thread lihu
RDD is just a wrap of the scala collection, Maybe you can use the .collect() method to get the scala collection type, you can then transfer to a JSON object using the Scala method.

Issues on schemaRDD's function got stuck

2014-12-08 Thread LinQili
Hi all:I was running HiveFromSpark on yarn-cluster. While I got the hive select's result schemaRDD and tried to run `collect()` on it, the application got stuck and don't know what's wrong with it. Here is my code: val sqlStat = sSELECT * FROM $TABLE_NAME val result = hiveContext.hql(sqlStat)