Re: Serialization issue when using HBase with Spark

2014-12-14 Thread Yanbo
In #1, class HTable can not be serializable. You also need to check you self defined function getUserActions and make sure it is a member function of one class who implement serializable interface. 发自我的 iPad 在 2014年12月12日,下午4:35,yangliuyu yangli...@163.com 写道: The scenario is using HTable

Re: JSON Input files

2014-12-14 Thread Madabhattula Rajesh Kumar
Hi Helena and All, I have a below example JSON file format. My use case is to read NAME variable. When I execute I got next exception *Exception in thread main org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved attributes: 'NAME, tree:Project ['NAME] Subquery device*

Re: Spark-SQL JDBC driver

2014-12-14 Thread Michael Armbrust
I'll add that there is an experimental method that allows you to start the JDBC server with an existing HiveContext (which might have registered temporary tables).

Re: SchemaRDD partition on specific column values?

2014-12-14 Thread Michael Armbrust
I'm happy to discuss what it would take to make sure we can propagate this information correctly. Please open a JIRA (and mention me in it). Regarding including it in 1.2.1, it depends on how invasive the change ends up being, but it is certainly possible. On Thu, Dec 11, 2014 at 3:55 AM, nitin

Re: JSON Input files

2014-12-14 Thread Yanbo
Pay attention to your JSON file, try to change it like following. Each record represent as a JSON string. {NAME : Device 1, GROUP : 1, SITE : qqq, DIRECTION : East, } {NAME : Device 2, GROUP : 2, SITE : sss, DIRECTION : North, } 在

Re: Read data from SparkStreaming from Java socket.

2014-12-14 Thread Guillermo Ortiz
Why doesn't it work?? I guess that it's the same with \n. 2014-12-13 12:56 GMT+01:00 Guillermo Ortiz konstt2...@gmail.com: I got it, thanks,, a silly question,, why if I do: out.write(hello + System.currentTimeMillis() + \n); it doesn't detect anything and if I do out.println(hello +

Re: Read data from SparkStreaming from Java socket.

2014-12-14 Thread Gerard Maas
Are you using a bufferedPrintWriter? that's probably a different flushing behaviour. Try doing out.flush() after out.write(...) and you will have the same result. This is Spark unrelated btw. -kr, Gerard.

Re: MLLIB model export: PMML vs MLLIB serialization

2014-12-14 Thread selvinsource
Hi Sourabh, have a look at https://issues.apache.org/jira/browse/SPARK-1406, I am looking into exporting models in PMML using JPMML. Regards, Vincenzo -- View this message in context:

Re: Read data from SparkStreaming from Java socket.

2014-12-14 Thread Guillermo Ortiz
Thanks. 2014-12-14 12:20 GMT+01:00 Gerard Maas gerard.m...@gmail.com: Are you using a bufferedPrintWriter? that's probably a different flushing behaviour. Try doing out.flush() after out.write(...) and you will have the same result. This is Spark unrelated btw. -kr, Gerard.

Re: Having problem with Spark streaming with Kinesis

2014-12-14 Thread Aniket Bhatnagar
The reason is because of the following code: val numStreams = numShards val kinesisStreams = (0 until numStreams).map { i = KinesisUtils.createStream(ssc, streamName, endpointUrl, kinesisCheckpointInterval, InitialPositionInStream.LATEST, StorageLevel.MEMORY_AND_DISK_2) } In the above

Re: JSON Input files

2014-12-14 Thread Madabhattula Rajesh Kumar
Thank you Yanbo Regards, Rajesh On Sun, Dec 14, 2014 at 3:15 PM, Yanbo yanboha...@gmail.com wrote: Pay attention to your JSON file, try to change it like following. Each record represent as a JSON string. {NAME : Device 1, GROUP : 1, SITE : qqq, DIRECTION : East,

pyspark is crashing in this case. why?

2014-12-14 Thread genesis fatum
Hi, My environment is: standalone spark 1.1.1 on windows 8.1 pro. The following case works fine: a = [1,2,3,4,5,6,7,8,9] b = [] for x in range(10): ... b.append(a) ... rdd1 = sc.parallelize(b) rdd1.first() [1, 2, 3, 4, 5, 6, 7, 8, 9] The following case does not work. The only

MLlib vs Madlib

2014-12-14 Thread Venkat, Ankam
Can somebody throw light on MLlib vs Madlib? Which is better for machine learning? and are there any specific use case scenarios MLlib or Madlib will shine in? Regards, Venkat Ankam This communication is the property of CenturyLink and may contain confidential or privileged information.

Limit the # of columns in Spark Scala

2014-12-14 Thread Denny Lee
I have a large of files within HDFS that I would like to do a group by statement ala val table = sc.textFile(hdfs://) val tabs = table.map(_.split(\t)) I'm trying to do something similar to tabs.map(c = (c._(167), c._(110), c._(200)) where I create a new RDD that only has but that isn't

Re: Limit the # of columns in Spark Scala

2014-12-14 Thread Gerard Maas
Hi, I don't get what the problem is. That map to selected columns looks like the way to go given the context. What's not working? Kr, Gerard On Dec 14, 2014 5:17 PM, Denny Lee denny.g@gmail.com wrote: I have a large of files within HDFS that I would like to do a group by statement ala

Re: Limit the # of columns in Spark Scala

2014-12-14 Thread Denny Lee
Getting a bunch of syntax errors. Let me get back with the full statement and error later today. Thanks for verifying my thinking wasn't out in left field. On Sun, Dec 14, 2014 at 08:56 Gerard Maas gerard.m...@gmail.com wrote: Hi, I don't get what the problem is. That map to selected columns

Re: MLlib vs Madlib

2014-12-14 Thread Brian Dolan
MADLib (http://madlib.net/) was designed to bring large-scale ML techniques to a relational database, primarily postgresql. MLlib assumes the data exists in some Spark-compatible data format. I would suggest you pick the library that matches your data platform first. DISCLAIMER: I am the

DStream demultiplexer based on a key

2014-12-14 Thread Jean-Pascal Billaud
Hey, I am doing an experiment with Spark Streaming consisting of moving data from Kafka to S3 locations while partitioning by date. I have already looked into Linked Camus and Pinterest Secor and while both are workable solutions, it just feels that Spark Streaming should be able to be on par

Re: pyspark is crashing in this case. why?

2014-12-14 Thread Sameer Farooqui
How much executor-memory are you setting for the JVM? What about the Driver JVM memory? Also check the Windows Event Log for Out of memory errors for one of the 2 above JVMs. On Dec 14, 2014 6:04 AM, genesis fatum genesis.fa...@gmail.com wrote: Hi, My environment is: standalone spark 1.1.1 on

Re: DStream demultiplexer based on a key

2014-12-14 Thread Gerard Maas
Hi Jean-Pascal, At Virdata we do a similar thing to 'bucketize' our data to different keyspaces in Cassandra. The basic construction would be to filter the DStream (or the underlying RDD) for each key and then apply the usual storage operations on that new data set. Given that, in your case, you

Re: DStream demultiplexer based on a key

2014-12-14 Thread Jean-Pascal Billaud
Ah! That sounds very much like what I need. A very basic question (most likely), why is rdd.cache() critical? Isn't it already true that in Spark Streaming DStream are cached in memory anyway? Also any experience with minutes long batch interval? Thanks for the quick answer! On Sun, Dec 14,

HTTP 500 Error for SparkUI in YARN Cluster mode

2014-12-14 Thread Benyi Wang
I got this error when I click Track URL: ApplicationMaster when I run a spark job in YARN cluster mode. I found this jira https://issues.apache.org/jira/browse/YARN-800, but I could not get this problem fixed. I'm running CDH 5.1.0 with Both HDFS and RM HA enabled. Does anybody has the similar

Re: DStream demultiplexer based on a key

2014-12-14 Thread Gerard Maas
I haven't done anything else than performance tuning on Spark Streaming for the past weeks. rdd.cache makes a huge difference. A must in this case where you want to iterate over the same RDD several times. Intuitively, I also thought that all data was in memory already so that wouldn't make a

Re: Limit the # of columns in Spark Scala

2014-12-14 Thread Michael Armbrust
BTW, I cannot use SparkSQL / case right now because my table has 200 columns (and I'm on Scala 2.10.3) You can still apply the schema programmatically: http://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema

Re: Trouble with cache() and parquet

2014-12-14 Thread Michael Armbrust
For many operations, Spark SQL will just pass the data through without looking at it. Caching, in contrast, has to process the data so that we can build up compressed column buffers. So the schema is mismatched in both cases, but only the caching case shows it. Based on the exception, it looks

Re: Limit the # of columns in Spark Scala

2014-12-14 Thread Denny Lee
Yes - that works great! Sorry for implying I couldn't. Was just more flummoxed that I couldn't make the Scala call work on its own. Will continue to debug ;-) On Sun, Dec 14, 2014 at 11:39 Michael Armbrust mich...@databricks.com wrote: BTW, I cannot use SparkSQL / case right now because my table

spark kafka batch integration

2014-12-14 Thread Koert Kuipers
hello all, we at tresata wrote a library to provide for batch integration between spark and kafka (distributed write of rdd to kafa, distributed read of rdd from kafka). our main use cases are (in lambda architecture jargon): * period appends to the immutable master dataset on hdfs from kafka

Re: Limit the # of columns in Spark Scala

2014-12-14 Thread Yana Kadiyska
Denny, I am not sure what exception you're observing but I've had luck with 2 things: val table = sc.textFile(hdfs://) You can try calling table.first here and you'll see the first line of the file. You can also do val debug = table.first.split(\t) which would give you an array and you can

RE: MLlib vs Madlib

2014-12-14 Thread Venkat, Ankam
Thanks for the info Brian. I am trying to compare performance difference between Pivotal HAWQ/Greenplum with MADlib vs HDFS with MLlib. Do you think Spark MLlib will perform better because of in-memory, caching and iterative processing capabilities? I need to perform large scale text

Re: MLlib vs Madlib

2014-12-14 Thread Brian Dolan
I don't have any solid performance numbers, no. Let's start with some questions * Do you have to do any feature extraction before you start the routine? E.g. NLP, NER or tokenization? Have you already vectorized? * Which routine(s) do you wish to use? Things like k-means do very well in a

Re: Running spark-submit from a remote machine using a YARN application

2014-12-14 Thread Tobias Pfeiffer
Hi, On Fri, Dec 12, 2014 at 7:01 AM, ryaminal tacmot...@gmail.com wrote: Now our solution is to make a very simply YARN application which execustes as its command spark-submit --master yarn-cluster s3n://application/jar.jar This seemed so simple and elegant, but it has some weird

Re: Adding a column to a SchemaRDD

2014-12-14 Thread Tobias Pfeiffer
Nathan, On Fri, Dec 12, 2014 at 3:11 PM, Nathan Kronenfeld nkronenf...@oculusinfo.com wrote: I can see how to do it if can express the added values in SQL - just run SELECT *,valueCalculation AS newColumnName FROM table I've been searching all over for how to do this if my added value is a

Run Spark job on Playframework + Spark Master/Worker in one Mac

2014-12-14 Thread Tomoya Igarashi
Hi all, I am trying to run Spark job on Playframework + Spark Master/Worker in one Mac. When job ran, I encountered java.lang.ClassNotFoundException. Would you teach me how to solve it? Here is my code in Github. https://github.com/TomoyaIgarashi/spark_cluster_sample * Envrionments: Mac 10.9.5

Re: what is the best way to implement mini batches?

2014-12-14 Thread Earthson
I think it could be done like: 1. using mapPartition to randomly drop some partition 2. drop some elements randomly(for selected partition) 3. calculate gradient step for selected elements I don't think fixed step is needed, but fixed step could be done: 1. zipWithIndex 2. create ShuffleRDD

Spark Streaming Python APIs?

2014-12-14 Thread Xiaoyong Zhu
Hi spark experts Are there any Python APIs for Spark Streaming? I didn't find the Python APIs in Spark Streaming programming guide.. http://spark.apache.org/docs/latest/streaming-programming-guide.html Xiaoyong

RE: Spark Streaming Python APIs?

2014-12-14 Thread Shao, Saisai
AFAIK, this will be a new feature in version 1.2, you can check out the master branch or 1.2 branch to take a try. Thanks Jerry From: Xiaoyong Zhu [mailto:xiaoy...@microsoft.com] Sent: Monday, December 15, 2014 10:53 AM To: user@spark.apache.org Subject: Spark Streaming Python APIs? Hi spark

RE: Spark Streaming Python APIs?

2014-12-14 Thread Xiaoyong Zhu
Cool thanks! Xiaoyong From: Shao, Saisai [mailto:saisai.s...@intel.com] Sent: Monday, December 15, 2014 10:57 AM To: Xiaoyong Zhu Cc: user@spark.apache.org Subject: RE: Spark Streaming Python APIs? AFAIK, this will be a new feature in version 1.2, you can check out the master branch or 1.2

RE: Spark Streaming Python APIs?

2014-12-14 Thread Xiaoyong Zhu
Btw I have seen the python related docs in the 1.2 doc here: http://people.apache.org/~pwendell/spark-1.2.0-rc2-docs/streaming-programming-guide.html Xiaoyong From: Xiaoyong Zhu [mailto:xiaoy...@microsoft.com] Sent: Monday, December 15, 2014 10:58 AM To: Shao, Saisai Cc: user@spark.apache.org

Re: ALS failure with size Integer.MAX_VALUE

2014-12-14 Thread Bharath Ravi Kumar
Hi Xiangrui, The block size limit was encountered even with reduced number of item blocks as you had expected. I'm wondering if I could try the new implementation as a standalone library against a 1.1 deployment. Does it have dependencies on any core API's in the current master? Thanks, Bharath

Re: Limit the # of columns in Spark Scala

2014-12-14 Thread Denny Lee
Oh, just figured it out: tabs.map(c = Array(c(167), c(110), c(200)) Thanks for all of the advice, eh?! On Sun Dec 14 2014 at 1:14:00 PM Yana Kadiyska yana.kadiy...@gmail.com wrote: Denny, I am not sure what exception you're observing but I've had luck with 2 things: val table =

Re: KafkaUtils explicit acks

2014-12-14 Thread Mukesh Jha
Thanks TD Francois for the explanation documentation. I'm curious if we have any performance benchmark with without WAL for spark-streaming-kafka. Also In spark-streaming-kafka (as kafka provides a way to acknowledge logs) on top of WAL can we modify KafkaUtils to acknowledge the offsets only

Q about Spark MLlib- Decision tree - scala.MatchError: 2.0 (of class java.lang.Double)

2014-12-14 Thread jake Lim
I am working some kind of Spark MLlib Test(Decision Tree) and I used IRIS data from Cran-R package. Original IRIS Data is not a good format for Spark MLlib. so I changed data format(change data format and features's location) When I ran sample Spark MLlib code for DT, I met the error like below

RE: KafkaUtils explicit acks

2014-12-14 Thread Shao, Saisai
Hi, It is not a trivial work to acknowledge the offsets when RDD is fully processed, I think from my understanding only modify the KafakUtils is not enough to meet your requirement, you need to add a metadata management stuff for each block/RDD, and track them both in executor-driver side, and