Implementing FIRST_VALUE, LEAD, LAG in Spark

2015-02-13 Thread Dmitry Tolpeko
Hello, To convert existing Map Reduce jobs to Spark, I need to implement window functions such as FIRST_VALUE, LEAD, LAG and so on. For example, FIRST_VALUE function: Source (1st column is key): A, A1 A, A2 A, A3 B, B1 B, B2 C, C1 and the result should be A, A1, A1 A, A2, A1 A, A3, A1 B, B1,

Interact with streams in a non-blocking way

2015-02-13 Thread jamborta
Hi all, I am trying to come up with a workflow where I can query streams asynchronously. The problem I have is a ssc.awaitTermination() line blocks the whole thread, so it is not straightforward to me whether it is possible to get hold of objects from streams once they are started. any suggestion

Re: Interact with streams in a non-blocking way

2015-02-13 Thread Sean Owen
You call awaitTermination() in the main thread, and indeed it blocks there forever. From there Spark Streaming takes over, and is invoking the operations you set up. Your operations have access to the data of course. That's the model; you don't make external threads that reach in to Spark

Re: Can spark job server be used to visualize streaming data?

2015-02-13 Thread Mayur Rustagi
Frankly no good/standard way to visualize streaming data. So far I have found HBase as good intermediate store to store data from streams visualize it by a play based framework d3.js. Regards Mayur On Fri Feb 13 2015 at 4:22:58 PM Kevin (Sangwoo) Kim kevin...@apache.org wrote: I'm not very

SparkSQL and star schema

2015-02-13 Thread Paolo Platter
Hi, is SparkSQL + Parquet suitable to replicate a star schema ? Paolo Platter AgileLab CTO

Re: An interesting and serious problem I encountered

2015-02-13 Thread Sean Owen
A number of comments: 310GB is probably too large for an executor. You probably want many smaller executors per machine. But this is not your problem. You didn't say where the OutOfMemoryError occurred. Executor or driver? Tuple2 is a Scala type, and a general type. It is appropriate for

Spark Distributed Join

2015-02-13 Thread Ashish Mukherjee
Hello, I have the following scenario and was wondering if I can use Spark to address it. I want to query two different data stores (say, ElasticSearch and MySQL) and then merge the two result sets based on a join key between the two. Is it appropriate to use Spark to do this join, if the

Re: saveAsHadoopFile is not a member of ... RDD[(String, MyObject)]

2015-02-13 Thread Sean Owen
It is a wrapper whose API is logically the same, but whose method signature make more sense in Java. You can call the Scala API in Java without too much trouble, but it gets messy when you have to manually grapple with ClassTag from Java for example. There is not an implicit conversion since it

Re: Can spark job server be used to visualize streaming data?

2015-02-13 Thread Kevin (Sangwoo) Kim
I'm not very sure for CDH 5.3, but now Zeppelin works for Spark 1.2 as spark-repl has been published in Spark 1.2.1 Please try again! On Fri Feb 13 2015 at 3:55:19 PM Su She suhsheka...@gmail.com wrote: Thanks Kevin for the link, I have had issues trying to install zeppelin as I believe it is

Tuning number of partitions per CPU

2015-02-13 Thread Igor Petrov
Hello, In Spark programming guide (http://spark.apache.org/docs/1.2.0/programming-guide.html) there is a recommendation: Typically you want 2-4 partitions for each CPU in your cluster. We have a Spark Master and two Spark workers each with 18 cores and 18 GB of RAM. In our application we use

Re: Tuning number of partitions per CPU

2015-02-13 Thread Sean Owen
18 cores or 36? doesn't probably matter. For this case where you have some overhead per partition of setting up the DB connection, it may indeed not help to chop up the data more finely than your total parallelism. Although that would imply quite an overhead. Are you doing any other expensive

Re: Which version to use for shuffle service if I'm going to run multiple versions of Spark

2015-02-13 Thread Jianshi Huang
Get it. Thanks Reynold and Andrew! Jianshi On Thu, Feb 12, 2015 at 12:25 AM, Andrew Or and...@databricks.com wrote: Hi Jianshi, For YARN, there may be an issue with how a recently patch changes the accessibility of the shuffle files by the external shuffle service:

Re: Shuffle on joining two RDDs

2015-02-13 Thread Karlson
Does that mean partitioning does not work in Python? Or does this only effect joining? On 2015-02-12 19:27, Davies Liu wrote: The feature works as expected in Scala/Java, but not implemented in Python. On Thu, Feb 12, 2015 at 9:24 AM, Imran Rashid iras...@cloudera.com wrote: I wonder if

Re: saveAsHadoopFile is not a member of ... RDD[(String, MyObject)]

2015-02-13 Thread Vladimir Protsenko
Thank you for clarification, Sean. 2015-02-13 14:16 GMT+04:00 Sean Owen so...@cloudera.com: It is a wrapper whose API is logically the same, but whose method signature make more sense in Java. You can call the Scala API in Java without too much trouble, but it gets messy when you have to

Re: Interact with streams in a non-blocking way

2015-02-13 Thread Tamas Jambor
Thanks for the reply, I am trying to setup a streaming as a service approach, using the framework that is used for spark-jobserver. for that I would need to handle asynchronous operations that are initiated from outside of the stream. Do you think it is not possible? On Fri Feb 13 2015 at

Re: Interact with streams in a non-blocking way

2015-02-13 Thread Sean Owen
Sure it's possible, but you would use Streaming to update some shared state, and create another service that accessed that shared state too. On Fri, Feb 13, 2015 at 11:57 AM, Tamas Jambor jambo...@gmail.com wrote: Thanks for the reply, I am trying to setup a streaming as a service approach,

Re: Shuffle on joining two RDDs

2015-02-13 Thread Karlson
In https://github.com/apache/spark/blob/master/python/pyspark/join.py#L38, wouldn't it help to change the lines vs = rdd.map(lambda (k, v): (k, (1, v))) ws = other.map(lambda (k, v): (k, (2, v))) to vs = rdd.mapValues(lambda v: (1, v)) ws = other.mapValues(lambda v: (2, v))

how to get SchemaRDD SQL exceptions i.e. table not found exception

2015-02-13 Thread sachin Singh
Hi, can some one guide how to get SQL Exception trapped for query executed using SchemaRDD, i mean suppose table not found thanks in advance, -- View this message in context:

RE: Tuning number of partitions per CPU

2015-02-13 Thread Puneet Kumar Ojha
Use below configuration if u r using 1.2 version:- SET spark.shuffle.consolidateFiles=true; SET spark.rdd.compress=true; SET spark.default.parallelism=1000; SET spark.deploy.defaultCores=54; Thanks Puneet. -Original Message- From: Sean Owen [mailto:so...@cloudera.com] Sent: Friday,

Columnar-Oriented RDDs

2015-02-13 Thread Night Wolf
Hi all, I'd like to build/use column oriented RDDs in some of my Spark code. A normal Spark RDD is stored as row oriented object if I understand correctly. I'd like to leverage some of the advantages of a columnar memory format. Shark (used to) and SparkSQL uses a columnar storage format using

Re: running spark project using java -cp command

2015-02-13 Thread Christophe Préaud
You can also export the variable SPARK_PRINT_LAUNCH_COMMAND before launching a spark-submit command to display the java command that will be launched, e.g.: export SPARK_PRINT_LAUNCH_COMMAND=1 /opt/spark/bin/spark-submit --master yarn --deploy-mode cluster --class kelkoo.SparkAppTemplate --jars

Documentation error in MLlib - Clustering?

2015-02-13 Thread Emre Sevinc
Hello, I was trying the streaming kmeans clustering example in the official documentation at: http://spark.apache.org/docs/1.2.0/mllib-clustering.html But I've got a type error when I tried to compile the code: [error] found :

Re: Counters in Spark

2015-02-13 Thread Mark Hamstra
Except that transformations don't have an exactly-once guarantee, so this way of doing counters may produce different answers across various forms of failures and speculative execution. On Fri, Feb 13, 2015 at 8:56 AM, Sean McNamara sean.mcnam...@webtrends.com wrote: .map is just a

Re: Documentation error in MLlib - Clustering?

2015-02-13 Thread Sean Owen
Agree, it's correct in the linear methods doc page too. You can open a PR for this simple 'typo' fix; I imagine it just wasn't updated or fixed with the others. On Fri, Feb 13, 2015 at 4:26 PM, Emre Sevinc emre.sev...@gmail.com wrote: Hello, I was trying the streaming kmeans clustering example

Counters in Spark

2015-02-13 Thread nitinkak001
I am trying to implement counters in Spark and I guess Accumulators are the way to do it. My motive is to update a counter in map function and access/reset it in the driver code. However the /println/ statement at the end still yields value 0(It should 9). Am I doing something wrong? def

Re: Counters in Spark

2015-02-13 Thread Sean McNamara
.map is just a transformation, so no work will actually be performed until something takes action against it. Try adding a .count(), like so: inputRDD.map { x = { counter += 1 } }.count() In case it is helpful, here are the docs on what exactly the transformations and actions are:

Re: Cluster launch

2015-02-13 Thread Brad
I'm playing around with Spark on Windows and have to worker nodes running by starting them manually using a script that contains the following set SPARK_HOME=C:\dev\programs\spark-1.2.0 set SPARK_MASTER_IP=master.brad.com spark-class org.apache.spark.deploy.worker.Worker

Re: An interesting and serious problem I encountered

2015-02-13 Thread Ye Xianjin
Hi, I believe SizeOf.jar may calculate the wrong size for you. Spark has a util call SizeEstimator located in org.apache.spark.util.SizeEstimator. And some one extracted it out in

How to union RDD and remove duplicated keys

2015-02-13 Thread Wang, Ningjun (LNG-NPV)
I have multiple RDD[(String, String)] that store (docId, docText) pairs, e.g. rdd1: (id1, Long text 1), (id2, Long text 2), (id3, Long text 3) rdd2: (id1, Long text 1 A), (id2, Long text 2 A) rdd3: (id1, Long text 1 B) Then, I want to merge all RDDs. If there is duplicated docids, later

Size exceeds Integer.MAX_VALUE exception when broadcasting large variable

2015-02-13 Thread soila
I am trying to broadcast a large 5GB variable using Spark 1.2.0. I get the following exception when the size of the broadcast variable exceeds 2GB. Any ideas on how I can resolve this issue? java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE at

Re: How to union RDD and remove duplicated keys

2015-02-13 Thread Boromir Widas
reducebyKey should work, but you need to define the ordering by using some sort of index. On Fri, Feb 13, 2015 at 12:38 PM, Wang, Ningjun (LNG-NPV) ningjun.w...@lexisnexis.com wrote: I have multiple RDD[(String, String)] that store (docId, docText) pairs, e.g. rdd1: (“id1”, “Long text

Re: Size exceeds Integer.MAX_VALUE exception when broadcasting large variable

2015-02-13 Thread Sean Owen
I think you've hit the nail on the head. Since the serialization ultimately creates a byte array, and arrays can have at most ~2 billion elements in the JVM, the broadcast can be at most ~2GB. At that scale, you might consider whether you really have to broadcast these values, or want to handle

Re: Shuffle on joining two RDDs

2015-02-13 Thread Imran Rashid
yeah I thought the same thing at first too, I suggested something equivalent w/ preservesPartitioning = true, but that isn't enough. the join is done by union-ing the two transformed rdds, which is very different from the way it works under the hood in scala to enable narrow dependencies. It

Re: Size exceeds Integer.MAX_VALUE exception when broadcasting large variable

2015-02-13 Thread Imran Rashid
unfortunately this is a known issue: https://issues.apache.org/jira/browse/SPARK-1476 as Sean suggested, you need to think of some other way of doing the same thing, even if its just breaking your one big broadcast var into a few smaller ones On Fri, Feb 13, 2015 at 12:30 PM, Sean Owen

Re: SparkSQL doesn't seem to like $'s in column names

2015-02-13 Thread Michael Armbrust
Try using `backticks` to escape non-standard characters. On Fri, Feb 13, 2015 at 11:30 AM, Corey Nolet cjno...@gmail.com wrote: I don't remember Oracle ever enforcing that I couldn't include a $ in a column name, but I also don't thinking I've ever tried. When using sqlContext.sql(...), I

Spark standalone and HDFS 2.6

2015-02-13 Thread Grandl Robert
Hi guys, Probably a dummy question. Do you know how to compile Spark 0.9 to easily integrate with HDFS 2.6.0 ? I was trying sbt/sbt -Pyarn -Phadoop-2.6 assembly  ormvn -Dhadoop.version=2.6.0 -DskipTests clean package but none of these approaches succeeded. Thanks,Robert

SparkSQL doesn't seem to like $'s in column names

2015-02-13 Thread Corey Nolet
I don't remember Oracle ever enforcing that I couldn't include a $ in a column name, but I also don't thinking I've ever tried. When using sqlContext.sql(...), I have a SELECT * from myTable WHERE locations_$homeAddress = '123 Elm St' It's telling me $ is invalid. Is this a bug?

New ColumnType For Decimal Caching

2015-02-13 Thread Manoj Samel
Thanks Michael for the pointer Sorry for the delayed reply. Taking a quick inventory of scope of change - Is the column type for Decimal caching needed only in the caching layer (4 files in org.apache.spark.sql.columnar - ColumnAccessor.scala, ColumnBuilder.scala, ColumnStats.scala,

Re: Columnar-Oriented RDDs

2015-02-13 Thread Michael Armbrust
Shark's in-memory code was ported to Spark SQL and is used by default when you run .cache on a SchemaRDD or CACHE TABLE. I'd also look at parquet which is more efficient and handles nested data better. On Fri, Feb 13, 2015 at 7:36 AM, Night Wolf nightwolf...@gmail.com wrote: Hi all, I'd like

Has Spark 1.2.0 changed EC2 persistent-hdfs?

2015-02-13 Thread Joe Wass
I've updated to Spark 1.2.0 and the EC2 and the persistent-hdfs behaviour appears to have changed. My launch script is spark-1.2.0-bin-hadoop2.4/ec2/spark-ec2 --instance-type=m3.xlarge -s 5 --ebs-vol-size=1000 launch myproject When I ssh into master I get: $ df -h FilesystemSize

Re: Spark standalone and HDFS 2.6

2015-02-13 Thread Grandl Robert
I am trying to run BlinkDB(https://github.com/sameeragarwal/blinkdb) which seems to work only with Spark 0.9. However, if I want to access HDFS I need to compile Spark against Hadoop version which is running on my cluster(2.6.0). Hence, the versions problem ... On Friday, February 13,

Re: NaiveBayes classifier causes ShuffleDependency class cast exception

2015-02-13 Thread Tathagata Das
You cannot have two Spark Contexts in the same JVM active at the same time. Just create one SparkContext and then use it for both purpose. TD On Fri, Feb 6, 2015 at 8:49 PM, VISHNU SUBRAMANIAN johnfedrickena...@gmail.com wrote: Can you try creating just a single spark context and then try

Re: Counters in Spark

2015-02-13 Thread Imran Rashid
this is more-or-less the best you can do now, but as has been pointed out, accumulators don't quite fit the bill for counters. There is an open issue to do something better, but no progress on that so far https://issues.apache.org/jira/browse/SPARK-603 On Fri, Feb 13, 2015 at 11:12 AM, Mark

Re: Spark standalone and HDFS 2.6

2015-02-13 Thread Sean Owen
If you just need standalone mode, you don't need -Pyarn. There is no -Phadoop-2.6; you should use -Phadoop-2.4 for 2.4+. Yes, set -Dhadoop.version=2.6.0. That should be it. If that still doesn't work, define doesn't succeed. On Fri, Feb 13, 2015 at 7:13 PM, Grandl Robert

Re: Interact with streams in a non-blocking way

2015-02-13 Thread Tathagata Das
Here is an example of how you can do. Lets say myDStream contains the data that you may want to asynchornously query, say using, Spark SQL. val sqlContext = new SqlContext(streamingContext.sparkContext) myDStream.foreachRDD { rdd = // rdd is a RDD of case class

Re: Spark standalone and HDFS 2.6

2015-02-13 Thread Sean Owen
OK, from scanning the pom.xml, I think you would try: -Pyarn -Dhadoop.version=2.6.0 If it doesn't package or pass tests, then I'd assume it's not supported :( On Fri, Feb 13, 2015 at 7:33 PM, Grandl Robert rgra...@yahoo.com wrote: I am trying to run

Re: Has Spark 1.2.0 changed EC2 persistent-hdfs?

2015-02-13 Thread Joe Wass
Looks like this is caused by issue SPARK-5008: https://issues.apache.org/jira/browse/SPARK-5008 On 13 February 2015 at 19:04, Joe Wass jw...@crossref.org wrote: I've updated to Spark 1.2.0 and the EC2 and the persistent-hdfs behaviour appears to have changed. My launch script is

RE: How to union RDD and remove duplicated keys

2015-02-13 Thread Wang, Ningjun (LNG-NPV)
Do you mean first union all RDDs together and then do a reduceByKey()? Suppose my unioned RDD is rdd : (“id1”, “text 1”), (“id1”, “text 2”), (“id1”, “text 3”) How can I use reduceByKey to return (“id1”, “text 3”) ? I mean to take the last one entry for the same key Code snippet is

Re: Spark standalone and HDFS 2.6

2015-02-13 Thread Grandl Robert
Thanks Sean for your prompt response. I was trying to compile as following: mvn -Phadoop-2.4 -Dhadoop.version=2.6.0 -DskipTests clean package but I got a bunch of errors(see below). Hadoop-2.6.0 compiled correctly, and all hadoop jars are in .m2 repository. Do you have any idea what might

Re: Spark standalone and HDFS 2.6

2015-02-13 Thread Sean Owen
Oh right, you said Spark 0.9. Those profiles won't exist back then. I don't even know if Hadoop 2.6 will work with 0.9 as-is. The profiles were introduced later to fix up some compatibility. Why not use 1.2.1? On Fri, Feb 13, 2015 at 7:26 PM, Grandl Robert rgra...@yahoo.com wrote: Thanks Sean

Re: Size exceeds Integer.MAX_VALUE exception when broadcasting large variable

2015-02-13 Thread Soila Pertet Kavulya
Thanks Sean and Imran, I'll try splitting the broadcast variable into smaller ones. I had tried a regular join but it was failing due to high garbage collection overhead during the shuffle. One of the RDDs is very large and has a skewed distribution where a handful of keys account for 90% of the

Learning GraphX Questions

2015-02-13 Thread Matthew Bucci
Hello, I was looking at GraphX as I believe it can be useful in my research on temporal data and I had a number of questions about the system: 1) How do you actually run programs in GraphX? At the moment I've been doing everything live through the shell, but I'd obviously like to be able to

Re: Why are there different parts in my CSV?

2015-02-13 Thread Su She
Thanks Akhil for the suggestion, it is now only giving me one part - . Is there anyway I can just create a file rather than a directory? It doesn't seem like there is just a saveAsTextFile option for JavaPairRecieverDstream. Also, for the copy/merge api, how would I add that to my spark job?

Re: pyspark: Java null pointer exception when accessing broadcast variables

2015-02-13 Thread Davies Liu
This is fixed in 1.2.1, could you upgrade to 1.2.1? On Thu, Feb 12, 2015 at 4:55 AM, Rok Roskar rokros...@gmail.com wrote: Hi again, I narrowed down the issue a bit more -- it seems to have to do with the Kryo serializer. When I use it, then this results in a Null Pointer: rdd =

Re: Columnar-Oriented RDDs

2015-02-13 Thread Koert Kuipers
i wrote a proof of concept to automatically store any RDD of tuples or case classes in columar format using arrays (and strongly typed, so you get the benefit of primitive arrays). see: https://github.com/tresata/spark-columnar On Fri, Feb 13, 2015 at 3:06 PM, Michael Armbrust

SQLContext.applySchema strictness

2015-02-13 Thread Justin Pihony
Per the documentation: It is important to make sure that the structure of every Row of the provided RDD matches the provided schema. Otherwise, there will be runtime exception. However, it appears that this is not being enforced. import org.apache.spark.sql._ val sqlContext = new

Re: How to union RDD and remove duplicated keys

2015-02-13 Thread Boromir Widas
I have not run the following, but will be on these lines - rdd.zipWithIndex().map(x = (x._1._1, (x._1._2, x._2))).reduceByKey((a, b) = { if(a._2 b._2) a else b }).map(x = (x._1, x._2._1)) On Fri, Feb 13, 2015 at 3:27 PM, Wang, Ningjun (LNG-NPV) ningjun.w...@lexisnexis.com wrote: Do you mean

Re: SQLContext.applySchema strictness

2015-02-13 Thread Yin Huai
Hi Justin, It is expected. We do not check if the provided schema matches rows since all rows need to be scanned to give a correct answer. Thanks, Yin On Fri, Feb 13, 2015 at 1:33 PM, Justin Pihony justin.pih...@gmail.com wrote: Per the documentation: It is important to make sure that

Re: SQLContext.applySchema strictness

2015-02-13 Thread Justin Pihony
OK, but what about on an action, like collect()? Shouldn't it be able to determine the correctness at that time? On Fri, Feb 13, 2015 at 4:49 PM, Yin Huai yh...@databricks.com wrote: Hi Justin, It is expected. We do not check if the provided schema matches rows since all rows need to be

Re: SparkSQL doesn't seem to like $'s in column names

2015-02-13 Thread Corey Nolet
This doesn't seem to have helped. On Fri, Feb 13, 2015 at 2:51 PM, Michael Armbrust mich...@databricks.com wrote: Try using `backticks` to escape non-standard characters. On Fri, Feb 13, 2015 at 11:30 AM, Corey Nolet cjno...@gmail.com wrote: I don't remember Oracle ever enforcing that I

Re: Boolean values as predicates in SQL string

2015-02-13 Thread Corey Nolet
Nevermind- I think I may have had a schema-related issue (sometimes booleans were represented as string and sometimes as raw booleans but when I populated the schema one or the other was chosen. On Fri, Feb 13, 2015 at 8:03 PM, Corey Nolet cjno...@gmail.com wrote: Here are the results of a

Re: SparkException: Task not serializable - Jackson Json

2015-02-13 Thread jamckelvey
I'm having the same problem with the same sample code. Any progress on this? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SparkException-Task-not-serializable-Jackson-Json-tp21347p21651.html Sent from the Apache Spark User List mailing list archive at

Boolean values as predicates in SQL string

2015-02-13 Thread Corey Nolet
Here are the results of a few different SQL strings (let's assume the schemas are valid for the data types used): SELECT * from myTable where key1 = true - no filters are pushed to my PrunedFilteredScan SELECT * from myTable where key1 = true and key2 = 5 - 1 filter (key2) is pushed to my

Re: Learning GraphX Questions

2015-02-13 Thread Ankur Dave
At 2015-02-13 12:19:46 -0800, Matthew Bucci mrbucci...@gmail.com wrote: 1) How do you actually run programs in GraphX? At the moment I've been doing everything live through the shell, but I'd obviously like to be able to work on it by writing and running scripts. You can create your own

Re: An interesting and serious problem I encountered

2015-02-13 Thread Landmark
Thanks for Ye Xianjin's suggestions. The SizeOf.jar may indeed have some problems. I did a simple test as follows. The codes are val n = 1; //5; //10; //100; //1000; val arr1 = new Array[(Int, Array[Int])](n); for(i - 0 until arr1.length){ arr1(i) = (i, new