SPARKTA: a real-time aggregation engine based on Spark Streaming

2015-05-14 Thread dmoralesdf
Hi there, We have released our real-time aggregation engine based on Spark Streaming. SPARKTA is fully open source (Apache2) You can checkout the slides showed up at the Strata past week: http://www.slideshare.net/Stratio/strata-sparkta Source code: https://github.com/Stratio/sparkta And

Re: how to set random seed

2015-05-14 Thread Charles Hayden
Thanks for the reply. I have not tried it out (I will today and report on my results) but I think what I need to do is to call mapPartitions and pass it a function that sets the seed. I was planning to pass the seed value in the closure. Something like: my_seed = 42 def f(iterator):

Re: Using sc.HadoopConfiguration in Python

2015-05-14 Thread Ram Sriharsha
Here is an example of how I would pass in the S3 parameters to hadoop configuration in pyspark. You can do something similar for other parameters you want to pass to the hadoop configuration hadoopConf=sc._jsc.hadoopConfiguration() hadoopConf.set(fs.s3.impl,

Re: reduceByKey

2015-05-14 Thread Gaspar Muñoz
What have you tried so far? Maybe, the easiest way is using a collection and reduce them adding its values. JavaPairRDDString, String pairRDD = sc.parallelizePairs(data); JavaPairRDDString, ListInteger result = pairRDD.mapToPair(new Functions.createList()) .mapToPair(new

Re: how to delete data from table in sparksql

2015-05-14 Thread Denny Lee
Delete from table is available as part of Hive 0.14 (reference: Apache Hive Language Manual DML - Delete https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-Delete) while Spark 1.3 defaults to Hive 0.13.Perhaps rebuild Spark with Hive 0.14 or generate a new

Re: [Unit Test Failure] Test org.apache.spark.streaming.JavaAPISuite.testCount failed

2015-05-14 Thread Wangfei (X)
Yes it is repeatedly on my locally Jenkins. 发自我的 iPhone 在 2015年5月14日,18:30,Tathagata Das t...@databricks.commailto:t...@databricks.com 写道: Do you get this failure repeatedly? On Thu, May 14, 2015 at 12:55 AM, kf wangf...@huawei.commailto:wangf...@huawei.com wrote: Hi, all, i got following

Re: how to read lz4 compressed data using fileStream of spark streaming?

2015-05-14 Thread lisendong
it still does’t work… the streamingcontext could detect the new file, but it shows: ERROR dstream.FileInputDStream: File hdfs://nameservice1/sandbox/hdfs/list_join_action/2015_05_14_20_stream_1431605640.lz4 has no data in it. Spark Streaming can only ingest files that have been moved to the

reduceByKey

2015-05-14 Thread Yasemin Kaya
Hi, I have JavaPairRDDString, String and I want to implement reduceByKey method. My pairRDD : *2553: 0,0,0,1,0,0,0,0* 46551: 0,1,0,0,0,0,0,0 266: 0,1,0,0,0,0,0,0 *2553: 0,0,0,0,0,1,0,0* *225546: 0,0,0,0,0,1,0,0* *225546: 0,0,0,0,0,1,0,0* I want to get : *2553: 0,0,0,1,0,1,0,0* 46551:

Re: --jars works in yarn-client but not yarn-cluster mode, why?

2015-05-14 Thread Fengyun RAO
thanks, Wilfred. In our program, the htrace-core-3.1.0-incubating.jar dependency is only required in the executor, not in the driver. while in both yarn-client and yarn-cluster, the executor runs in cluster. and it's clearly in yarn-cluster mode, the jar IS in spark.yarn.secondary.jars, but

Re: spark sql hive-shims

2015-05-14 Thread Lior Chaga
I see that the pre-built distributions includes hive-shims-0.23 shaded in spark-assembly jar (unlike when I make the distribution myself). Does anyone knows what I should do to include the shims in my distribution? On Thu, May 14, 2015 at 9:52 AM, Lior Chaga lio...@taboola.com wrote:

Re: SPARKTA: a real-time aggregation engine based on Spark Streaming

2015-05-14 Thread Matei Zaharia
...This is madness! On May 14, 2015, at 9:31 AM, dmoralesdf dmora...@stratio.com wrote: Hi there, We have released our real-time aggregation engine based on Spark Streaming. SPARKTA is fully open source (Apache2) You can checkout the slides showed up at the Strata past week:

RE: swap tuple

2015-05-14 Thread Stephen Carman
Yea, I wouldn't try and modify the current since RDDs are suppose to be immutable, just create a new one... val newRdd = oldRdd.map(r = (r._2(), r._1())) or something of that nature... Steve From: Evo Eftimov [evo.efti...@isecc.com] Sent: Thursday, May 14, 2015

Re: how to delete data from table in sparksql

2015-05-14 Thread Michael Armbrust
The list of unsupported hive features should mention that it implicitly includes features added after Hive 13. You cannot yet compile with Hive 13, though we are investigating this for 1.5 On Thu, May 14, 2015 at 6:40 AM, Denny Lee denny.g@gmail.com wrote: Delete from table is available

RE: swap tuple

2015-05-14 Thread Evo Eftimov
Where is the “Tuple” supposed to be in String, String - you can refer to a “Tuple” if it was e.g. String, Tuple2String, String From: holden.ka...@gmail.com [mailto:holden.ka...@gmail.com] On Behalf Of Holden Karau Sent: Thursday, May 14, 2015 5:56 PM To: Yasemin Kaya Cc:

Re: store hive metastore on persistent store

2015-05-14 Thread Michael Armbrust
You can configure Spark SQLs hive interaction by placing a hive-site.xml file in the conf/ directory. On Thu, May 14, 2015 at 10:24 AM, jamborta jambo...@gmail.com wrote: Hi all, is it possible to set hive.metastore.warehouse.dir, that is internally create by spark, to be stored externally

Re: swap tuple

2015-05-14 Thread Yasemin Kaya
I solved my problem right this way. JavaPairRDDString, String swappedPair = pair.mapToPair( new PairFunctionTuple2String, String, String, String() { @Override public Tuple2String, String call( Tuple2String, String item) throws Exception { return item.swap(); } }); 2015-05-14 20:42 GMT+03:00

Re: swap tuple

2015-05-14 Thread Holden Karau
Can you paste your code? transformations return a new RDD rather than modifying an existing one, so if you were to swap the values of the tuple using a map you would get back a new RDD and then you would want to try and print this new RDD instead of the original one. On Thursday, May 14, 2015,

store hive metastore on persistent store

2015-05-14 Thread jamborta
Hi all, is it possible to set hive.metastore.warehouse.dir, that is internally create by spark, to be stored externally (e.g. s3 on aws or wasb on azure)? thanks, -- View this message in context:

Re: [Spark SQL 1.3.1] data frame saveAsTable returns exception

2015-05-14 Thread Michael Armbrust
End of the month is the target: https://cwiki.apache.org/confluence/display/SPARK/Wiki+Homepage On Thu, May 14, 2015 at 3:45 AM, Ishwardeep Singh ishwardeep.si...@impetus.co.in wrote: Hi Michael Ayan, Thank you for your response to my problem. Michael do we have a tentative release

Re: reduceByKey

2015-05-14 Thread ayan guha
Here is a python code, I am sure you'd get the drift. Basically you need to implement 2 functions: seq and comb in order to partial and final operations. def addtup(t1,t2): j=() for k,v in enumerate(t1): j=j+(t1[k]+t2[k],) return j def seq(tIntrm,tNext): return

Re: Using sc.HadoopConfiguration in Python

2015-05-14 Thread ayan guha
Super, it worked. Thanks On Fri, May 15, 2015 at 12:26 AM, Ram Sriharsha sriharsha@gmail.com wrote: Here is an example of how I would pass in the S3 parameters to hadoop configuration in pyspark. You can do something similar for other parameters you want to pass to the hadoop

Re: SPARKTA: a real-time aggregation engine based on Spark Streaming

2015-05-14 Thread Paolo Platter
Nice Job! we are developing something very similar... I will contact you to understand if we can contribute to you with some piece ! Best Paolo Da: Evo Eftimovmailto:evo.efti...@isecc.com Data invio: ?gioved?? ?14? ?maggio? ?2015 ?17?:?21 A: 'David Morales'mailto:dmora...@stratio.com, Matei

RE: SPARKTA: a real-time aggregation engine based on Spark Streaming

2015-05-14 Thread Evo Eftimov
I do not intend to provide comments on the actual “product” since my time is engaged elsewhere My comments were on the “process” for commenting which looked as self-indulgent, self patting on the back communication (between members of the party and its party leader) – that bs used to be

Re: SPARKTA: a real-time aggregation engine based on Spark Streaming

2015-05-14 Thread David Morales
We put a lot of work in sparkta and it is awesome to hear from both the community and relevant people. Just as easy as that. I hope you have time to consider the project, which is our main concern at this moment, and hear from you too. 2015-05-14 17:46 GMT+02:00 Evo Eftimov

Re: SPARKTA: a real-time aggregation engine based on Spark Streaming

2015-05-14 Thread Matei Zaharia
(Sorry, for non-English people: that means it's a good thing.) Matei On May 14, 2015, at 10:53 AM, Matei Zaharia matei.zaha...@gmail.com wrote: ...This is madness! On May 14, 2015, at 9:31 AM, dmoralesdf dmora...@stratio.com wrote: Hi there, We have released our real-time

Re: SPARKTA: a real-time aggregation engine based on Spark Streaming

2015-05-14 Thread David Morales
Thank you Paolo. Don't hesitate to contact us. Evo, we will be glad to hear from you and we are happy to see some kind of fast feedback from the main thought leader of spark, for sure. 2015-05-14 17:24 GMT+02:00 Paolo Platter paolo.plat...@agilelab.it: Nice Job! we are developing

Re: SPARKTA: a real-time aggregation engine based on Spark Streaming

2015-05-14 Thread David Morales
Thanks for your kind words Matei, happy to see that our work is in the right way. 2015-05-14 17:10 GMT+02:00 Matei Zaharia matei.zaha...@gmail.com: (Sorry, for non-English people: that means it's a good thing.) Matei On May 14, 2015, at 10:53 AM, Matei Zaharia matei.zaha...@gmail.com

Multiple Kinesis Streams in a single Streaming job

2015-05-14 Thread Erich Ess
Hi, Is it possible to setup streams from multiple Kinesis streams and process them in a single job? From what I have read, this should be possible, however, the Kinesis layer errors out whenever I try to receive from more than a single Kinesis Stream. Here is the code. Currently, I am focused

Storing a lot of state with updateStateByKey

2015-05-14 Thread krot.vyacheslav
Hi all, I'm a complete newbie to spark and spark streaming, so the question may seem obvious, sorry for that. It is okay to store Seq[Data] in state when using 'updateStateByKey'? I have a function with signature def saveState(values: Seq[Msg], value: Option[Iterable[Msg]]): Option[Iterable[Msg]]

RE: SPARKTA: a real-time aggregation engine based on Spark Streaming

2015-05-14 Thread Evo Eftimov
That has been a really rapid “evaluation” of the “work” and its “direction” From: David Morales [mailto:dmora...@stratio.com] Sent: Thursday, May 14, 2015 4:12 PM To: Matei Zaharia Cc: user@spark.apache.org Subject: Re: SPARKTA: a real-time aggregation engine based on Spark Streaming

Per-machine configuration?

2015-05-14 Thread mj1200
Is it possible to configure each machine that Spark is using as a worker individually? For instance, setting the maximum number of cores to use for each machine individually, or the maximum memory, or other settings related to workers? Or is there any other way to specify a per-machine capacity

Hive partition table + read using hiveContext + spark 1.3.1

2015-05-14 Thread SamyaMaiti
Hi Team, I have a hive partition table with partition column having spaces. When I try to run any query, say a simple Select * from table_name, it fails. *Please note the same was working in spark 1.2.0, now I have upgraded to 1.3.1. Also there is no change in my application code base.* If I

Restricting the number of iterations in Mllib Kmeans

2015-05-14 Thread Suman Somasundar
Hi,, I want to run a definite number of iterations in Kmeans. There is a command line argument to set maxIterations, but even if I set it to a number, Kmeans runs until the centroids converge. Is there a specific way to specify it in command line? Also, I wanted to know if we can supply

Custom Aggregate Function for DataFrame

2015-05-14 Thread Justin Yip
Hello, May I know if these is way to implement aggregate function for grouped data in DataFrame? I dug into the doc but didn't find any apart from the UDF functions which applies on a Row. Maybe I have missed something. Thanks. Justin -- View this message in context:

Re: DStream Union vs. StreamingContext Union

2015-05-14 Thread Vadim Bichutskiy
@TD How do I file a JIRA? ᐧ On Tue, May 12, 2015 at 2:06 PM, Tathagata Das tathagata.das1...@gmail.com wrote: I wonder that may be a bug in the Python API. Please file it as a JIRA along with sample code to reproduce it and sample output you get. On Tue, May 12, 2015 at 10:00 AM, Vadim

textFileStream Question

2015-05-14 Thread Vadim Bichutskiy
How does textFileStream work behind the scenes? How does Spark Streaming know what files are new and need to be processed? Is it based on time stamp, file name? Thanks, Vadim ᐧ

spark log field clarification

2015-05-14 Thread yanwei
I am trying to extract the *output data size* information for *each task*. What *field(s)* should I look for, given the json-format log? Also, what does Result Size stand for? Thanks a lot in advance! -Yanwei -- View this message in context:

Re: store hive metastore on persistent store

2015-05-14 Thread Tamas Jambor
I have tried to put the hive-site.xml file in the conf/ directory with, seems it is not picking up from there. On Thu, May 14, 2015 at 6:50 PM, Michael Armbrust mich...@databricks.com wrote: You can configure Spark SQLs hive interaction by placing a hive-site.xml file in the conf/ directory.

Re: Multiple Kinesis Streams in a single Streaming job

2015-05-14 Thread Tathagata Das
What is the error you are seeing? TD On Thu, May 14, 2015 at 9:00 AM, Erich Ess er...@simplerelevance.com wrote: Hi, Is it possible to setup streams from multiple Kinesis streams and process them in a single job? From what I have read, this should be possible, however, the Kinesis layer

Data Load - Newbie

2015-05-14 Thread Ricardo Goncalves da Silva
Hi, I have a text dataset which I want to apply a cluster algorithm from MLIB. This data is (n,m) matrix with readers. I would like to know if the team can help me on how load this data in Spark Scala and separate the variable I want to cluster. Thanks Rick. [Descrição: Descrição:

LogisticRegressionWithLBFGS with large feature set

2015-05-14 Thread Pala M Muthaia
Hi, I am trying to validate our modeling data pipeline by running LogisticRegressionWithLBFGS on a dataset with ~3.7 million features, basically to compute AUC. This is on Spark 1.3.0. I am using 128 executors with 4 GB each + driver with 8 GB. The number of data partitions is 3072 The

Re: Multiple Kinesis Streams in a single Streaming job

2015-05-14 Thread Chris Fregly
have you tried to union the 2 streams per the KinesisWordCountASL example https://github.com/apache/spark/blob/branch-1.3/extras/kinesis-asl/src/main/scala/org/apache/spark/examples/streaming/KinesisWordCountASL.scala#L120 where 2 streams (against the same Kinesis stream in this case) are created

Re: Multiple Kinesis Streams in a single Streaming job

2015-05-14 Thread Tathagata Das
A possible problem may be that the kinesis stream in 1.3 uses the SparkContext app name, as the Kinesis Application Name, that is used by the Kinesis Client Library to save checkpoints in DynamoDB. Since both kinesis DStreams are using the Kinesis application name (as they are in the same

Spark Summit 2015 - June 15-17 - Dev list invite

2015-05-14 Thread Scott walent
*Join the Apache Spark community at the fourth Spark Summit in San Francisco on June 15, 2015. At Spark Summit 2015 you will hear keynotes from NASA, the CIA, Toyota, Databricks, AWS, Intel, MapR, IBM, Cloudera, Hortonworks, Timeful, O'Reilly, and Andreessen Horowitz. 260 talks proposal were

[SparkStreaming] Is it possible to delay the start of some DStream in the application?

2015-05-14 Thread Haopu Wang
In my application, I want to start a DStream computation only after an special event has happened (for example, I want to start the receiver only after the reference data has been properly initialized). My question is: it looks like the DStream will be started right after the StreaminContext has

RE: [SparkSQL 1.4.0] groupBy columns are always nullable?

2015-05-14 Thread Haopu Wang
Thank you, should I open a JIRA for this issue? From: Olivier Girardot [mailto:ssab...@gmail.com] Sent: Tuesday, May 12, 2015 5:12 AM To: Reynold Xin Cc: Haopu Wang; user Subject: Re: [SparkSQL 1.4.0] groupBy columns are always nullable? I'll look into it

RE: Is it feasible to keep millions of keys in state of Spark Streaming job for two months?

2015-05-14 Thread Haopu Wang
Hi TD, regarding to the performance of updateStateByKey, do you have a JIRA for that so we can watch it? Thank you! From: Tathagata Das [mailto:t...@databricks.com] Sent: Wednesday, April 15, 2015 8:09 AM To: Krzysztof Zarzycki Cc: user Subject: Re: Is it

Spark's Guava pieces cause exceptions in non-trivial deployments

2015-05-14 Thread Anton Brazhnyk
Greetings, I have a relatively complex application with Spark, Jetty and Guava (16) not fitting together. Exception happens when some components try to use mix of Guava classes (including Spark's pieces) that are loaded by different classloaders: java.lang.LinkageError: loader constraint

Re: Spark's Guava pieces cause exceptions in non-trivial deployments

2015-05-14 Thread Marcelo Vanzin
What version of Spark are you using? The bug you mention is only about the Optional class (and a handful of others, but none of the classes you're having problems with). All other Guava classes should be shaded since Spark 1.2, so you should be able to use your own version of Guava with no

Re: Spark performance in cluster mode using yarn

2015-05-14 Thread ayan guha
With this information it is hard to predict. What's the performance you are getting? What's your desired performance? Maybe you can post your code and experts can suggests improvement? On 14 May 2015 15:02, sachin Singh sachin.sha...@gmail.com wrote: Hi Friends, please someone can give the

Re: Multiple Kinesis Streams in a single Streaming job

2015-05-14 Thread Erich Ess
Hi Tathagata, I think that's exactly what's happening. The error message is: com.amazonaws.services.kinesis.model.InvalidArgumentException: StartingSequenceNumber 49550673839151225431779125105915140284622031848663416866 used in GetShardIterator on shard shardId-0002 in stream erich-test

RE: Spark's Guava pieces cause exceptions in non-trivial deployments

2015-05-14 Thread Anton Brazhnyk
The problem is with 1.3.1 It has Function class (mentioned in exception) in spark-network-common_2.10-1.3.1.jar. Our current resolution is actually backport to 1.2.2, which is working fine. From: Marcelo Vanzin [mailto:van...@cloudera.com] Sent: Thursday, May 14, 2015 6:27 PM To: Anton Brazhnyk

Re: Using sc.HadoopConfiguration in Python

2015-05-14 Thread ayan guha
Jo Thanks for the reply, but _jsc does not have anything to pass hadoop configs. can you illustrate your answer a bit more? TIA... On Wed, May 13, 2015 at 12:08 AM, Ram Sriharsha sriharsha@gmail.com wrote: yes, the SparkContext in the Python API has a reference to the JavaSparkContext

Spark 1.3.0 - 1.3.1 produces java.lang.NoSuchFieldError: NO_FILTER

2015-05-14 Thread Exie
Hello Bright Sparks, I was using Spark 1.3.0 to push data out to Parquet files. They have been working great, super fast, easy way to persist data frames etc. However I just swapped out Spark 1.3.0 and picked up the tarball for 1.3.1. I unzipped it, copied my config over and then went to read

Re: spark sql hive-shims

2015-05-14 Thread Lior Chaga
After profiling with YourKit, I see there's an OutOfMemoryException in context SQLContext.applySchema. Again, it's a very small RDD. Each executor has 180GB RAM. On Thu, May 14, 2015 at 8:53 AM, Lior Chaga lio...@taboola.com wrote: Hi, Using spark sql with HiveContext. Spark version is 1.3.1

Re: how to read lz4 compressed data using fileStream of spark streaming?

2015-05-14 Thread Akhil Das
What do you mean by not detected? may be you forgot to trigger some action on the stream to get it executed. Like: val list_join_action_stream = ssc.fileStream[LongWritable, Text, TextInputFormat](gc.input_dir, (t: Path) = true, false).map(_._2.toString) *list_join_action_stream.count().print()*

Re: kafka + Spark Streaming with checkPointing fails to restart

2015-05-14 Thread NB
The data pipeline (DAG) should not be added to the StreamingContext in the case of a recovery scenario. The pipeline metadata is recovered from the checkpoint folder. That is one thing you will need to fix in your code. Also, I don't think the ssc.checkpoint(folder) call should be made in case of

Re: spark sql hive-shims

2015-05-14 Thread Lior Chaga
Ultimately it was PermGen out of memory. I somehow missed it in the log On Thu, May 14, 2015 at 9:24 AM, Lior Chaga lio...@taboola.com wrote: After profiling with YourKit, I see there's an OutOfMemoryException in context SQLContext.applySchema. Again, it's a very small RDD. Each executor has

Re: How to run multiple jobs in one sparkcontext from separate threads in pyspark?

2015-05-14 Thread Akhil Das
Did you happened to have a look at the spark job server? https://github.com/ooyala/spark-jobserver Someone wrote a python wrapper https://github.com/wangqiang8511/spark_job_manager around it, give it a try. Thanks Best Regards On Thu, May 14, 2015 at 11:10 AM, MEETHU MATHEW

Re: Spark SQL: preferred syntax for column reference?

2015-05-14 Thread Michael Armbrust
Depends on which compile time you are talking about. *scala compile time*: No, the information about which columns are available is usually coming from a file or an external database which may or may not be available to scalac. *query compile time*: While your program is running, but before any

Re: spark-streaming whit flume error

2015-05-14 Thread Akhil Das
Can you share the client code that you used to send the data? May be this discussion would give you some insights http://apache-avro.679487.n3.nabble.com/Avro-RPC-Python-to-Java-isn-t-working-for-me-td4027454.html Thanks Best Regards On Thu, May 14, 2015 at 8:44 AM, 鹰 980548...@qq.com wrote:

RE: how to read lz4 compressed data using fileStream of spark streaming?

2015-05-14 Thread Saurabh Agrawal
How do I unsubscribe from this mailing list please? Thanks!! Regards, Saurabh Agrawal Vice President Markit Green Boulevard B-9A, Tower C 3rd Floor, Sector - 62, Noida 201301, India +91 120 611 8274 Office This e-mail, including accompanying communications

swap tuple

2015-05-14 Thread Yasemin Kaya
Hi, I have *JavaPairRDDString, String *and I want to *swap tuple._1() to tuple._2()*. I use *tuple.swap() *but it can't be changed JavaPairRDD in real. When I print JavaPairRDD, the values are same. Anyone can help me for that? Thank you. Have nice day. yasemin -- hiç ender hiç

Build change PSA: Hadoop 2.2 default; -Phadoop-x.y profile recommended for builds

2015-05-14 Thread Sean Owen
This change will be merged shortly for Spark 1.4, and has a minor implication for those creating their own Spark builds: https://issues.apache.org/jira/browse/SPARK-7249 https://github.com/apache/spark/pull/5786 The default Hadoop dependency has actually been Hadoop 2.2 for some time, but the

Re: Kafka stream fails: java.lang.NoClassDefFound com/yammer/metrics/core/Gauge

2015-05-14 Thread Tathagata Das
It would be good if you can tell what I should add to the documentation to make it easier to understand. I can update the docs for 1.4.0 release. On Tue, May 12, 2015 at 9:52 AM, Lee McFadden splee...@gmail.com wrote: Thanks for explaining Sean and Cody, this makes sense now. I'd like to help

Re: how to read lz4 compressed data using fileStream of spark streaming?

2015-05-14 Thread Akhil Das
That's because you are using TextInputFormat i think, try with LzoTextInputFormat like: val list_join_action_stream = ssc.fileStream[LongWritable, Text, com.hadoop.mapreduce.LzoTextInputFormat](gc.input_dir, (t: Path) = true, false).map(_._2.toString) Thanks Best Regards On Thu, May 14, 2015 at

Re: Unsubscribe

2015-05-14 Thread Akhil Das
Have a look https://spark.apache.org/community.html Send an email to user-unsubscr...@spark.apache.org Thanks Best Regards On Thu, May 14, 2015 at 1:08 PM, Saurabh Agrawal saurabh.agra...@markit.com wrote: How do I unsubscribe from this mailing list please? Thanks!! Regards,

Re: Unsubscribe

2015-05-14 Thread Ted Yu
Please see http://spark.apache.org/community.html Cheers On May 14, 2015, at 12:38 AM, Saurabh Agrawal saurabh.agra...@markit.com wrote: How do I unsubscribe from this mailing list please? Thanks!! Regards, Saurabh Agrawal Vice President Markit Green Boulevard B-9A,

[Unit Test Failure] Test org.apache.spark.streaming.JavaAPISuite.testCount failed

2015-05-14 Thread kf
Hi, all, i got following error when i run unit test of spark by dev/run-tests on the latest branch-1.4 branch. the latest commit id: commit d518c0369fa412567855980c3f0f426cde5c190d Author: zsxwing zsxw...@gmail.com Date: Wed May 13 17:58:29 2015 -0700 error [info] Test

Re: how to read lz4 compressed data using fileStream of spark streaming?

2015-05-14 Thread Akhil Das
Here's https://github.com/twitter/hadoop-lzo/blob/master/src/main/java/com/hadoop/mapreduce/LzoTextInputFormat.java the class. You can read more here https://github.com/twitter/hadoop-lzo#maven-repository Thanks Best Regards On Thu, May 14, 2015 at 1:22 PM, lisendong lisend...@163.com wrote:

Re: The explanation of input text format using LDA in Spark

2015-05-14 Thread Cui xp
hi keegan, Thanks a lot. Now I know the column represents all the words without repetition in all documents. I don't know what determine the order of the words, is there any difference when the column words with the different order? Thanks.

Re: kafka + Spark Streaming with checkPointing fails to restart

2015-05-14 Thread Ankur Chauhan
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Thanks everyone, that was the problem. the create new streaming context function was supposed to setup the stream processing as well as the checkpoint directory. I had missed the whole process of checkpoint setup. With that done, everything works as

RE: [Spark SQL 1.3.1] data frame saveAsTable returns exception

2015-05-14 Thread Ishwardeep Singh
Hi Michael Ayan, Thank you for your response to my problem. Michael do we have a tentative release date for Spark version 1.4? Regards, Ishwardeep From: Michael Armbrust [mailto:mich...@databricks.com] Sent: Wednesday, May 13, 2015 10:54 PM To: ayan guha Cc: Ishwardeep Singh; user Subject:

Re: how to set random seed

2015-05-14 Thread ayan guha
Sorry for late reply. Here is what I was thinking import random as r def main(): get SparkContext #Just for fun, lets assume seed is an id filename=bin.dat seed = id(filename) #broadcast it br = sc.broadcast(seed) #set up dummy list lst = [] for i in range(4):

how to delete data from table in sparksql

2015-05-14 Thread luohui20001
Hi guys i got to delete some data from a table by delete from table where name = xxx, however delete is not functioning like the DML operation in hive. I got a info like below:Usage: delete [FILE|JAR|ARCHIVE] value [value]* 15/05/14 18:18:24 ERROR processors.DeleteResourceProcessor:

Re: [Unit Test Failure] Test org.apache.spark.streaming.JavaAPISuite.testCount failed

2015-05-14 Thread Tathagata Das
Do you get this failure repeatedly? On Thu, May 14, 2015 at 12:55 AM, kf wangf...@huawei.com wrote: Hi, all, i got following error when i run unit test of spark by dev/run-tests on the latest branch-1.4 branch. the latest commit id: commit d518c0369fa412567855980c3f0f426cde5c190d

Word2Vec with billion-word corpora

2015-05-14 Thread shilad
Hi all, I'm experimenting with Spark's Word2Vec implementation for a relatively large (5B words, vocabulary size 4M, 400-dimensional vectors) corpora. Has anybody had success running it at this scale? Thanks in advance for your guidance! -Shilad -- View this message in context:

question about sparksql caching

2015-05-14 Thread sequoiadb
Hi all, We are planing to use SparkSQL in a DW system. There’s a question about the caching mechanism of SparkSQL. For example, if I have a SQL like sqlContext.sql(“select c1, sum(c2) from T1, T2 where T1.key=T2.key group by c1”).cache() Is it going to cache the final result or the raw data

回复:textFileStream Question

2015-05-14 Thread 董帅阳
file timestamp -- 原始邮件 -- 发件人: Vadim Bichutskiy;vadim.bichuts...@gmail.com; 发送时间: 2015年5月15日(星期五) 凌晨4:55 收件人: user@spark.apache.orguser@spark.apache.org; 主题: textFileStream Question How does textFileStream work behind the scenes? How does Spark Streaming

question about sparksql caching

2015-05-14 Thread sequoiadb
Hi all, We are planing to use SparkSQL in a DW system. There’s a question about the caching mechanism of SparkSQL. For example, if I have a SQL like sqlContext.sql(“select c1, sum(c2) from T1, T2 where T1.key=T2.key group by c1”).cache() Is it going to cache the final result or the raw data

Re: Multiple Kinesis Streams in a single Streaming job

2015-05-14 Thread Chris Fregly
another option (not really recommended, but worth mentioning) would be to change the region of dynamodb to be separate from the other stream - and even separate from the stream itself. this isn't available right now, but will be in Spark 1.4. On May 14, 2015, at 6:47 PM, Erich Ess