from:"Rishi Yadav"

Re: Equally split a RDD partition into two partition at the same node

2017-01-14 Thread Rishi Yadav

Can you provide some more details:
1. How many partitions does RDD have
2. How big is the cluster
On Sat, Jan 14, 2017 at 3:59 PM Fei Hu  wrote:

> Dear all,
>
> I want to equally divide a RDD partition into two partitions. That means,
> the first half of elements in the partition will create a new partition,
> and the second half of elements in the partition will generate another new
> partition. But the two new partitions are required to be at the same node
> with their parent partition, which can help get high data locality.
>
> Is there anyone who knows how to implement it or any hints for it?
>
> Thanks in advance,
> Fei
>
>

Re: Error: Exception in thread main java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/HBaseConfiguration

2015-08-16 Thread Rishi Yadav

try --jars rather than --class to submit jar.



On Fri, Aug 14, 2015 at 6:19 AM, Stephen Boesch java...@gmail.com wrote:

 The NoClassDefFoundException differs from ClassNotFoundException : it
 indicates an error while initializing that class: but the class is found in
 the classpath. Please provide the full stack trace.

 2015-08-14 4:59 GMT-07:00 stelsavva stel...@avocarrot.com:

 Hello, I am just starting out with spark streaming and Hbase/hadoop, i m
 writing a simple app to read from kafka and store to Hbase, I am having
 trouble submitting my job to spark.

 I 've downloaded Apache Spark 1.4.1 pre-build for hadoop 2.6

 I am building the project with mvn package

 and submitting the jar file with

  ~/Desktop/spark/bin/spark-submit --class org.example.main.scalaConsumer
 scalConsumer-0.0.1-SNAPSHOT.jar

 And then i am getting the error you see in the subject line. Is this a
 problem with my maven dependencies? do i need to install hadoop locally?
 And
 if so how can i add the hadoop classpath to the spark job?



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Error-Exception-in-thread-main-java-lang-NoClassDefFoundError-org-apache-hadoop-hbase-HBaseConfiguran-tp24266.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark can't fetch application jar after adding it to HTTP server

2015-08-16 Thread Rishi Yadav

can you tell more about your environment. I understand you are running it
on a single machine but is firewall enabled?

On Sun, Aug 16, 2015 at 5:47 AM, t4ng0 manvendra.tom...@gmail.com wrote:

 Hi

 I am new to spark and trying to run standalone application using
 spark-submit. Whatever i could understood, from logs is that spark can't
 fetch the jar file after adding it to the http server. Do i need to
 configure proxy settings for spark too individually if it is a problem.
 Otherwise please help me, thanks in advance.

 PS: i am attaching logs here.

  Using Spark's default log4j profile:
 org/apache/spark/log4j-defaults.properties 15/08/16 15:20:52 INFO
 SparkContext: Running Spark version 1.4.1 15/08/16 15:20:53 WARN
 NativeCodeLoader: Unable to load native-hadoop library for your platform...
 using builtin-java classes where applicable 15/08/16 15:20:53 INFO
 SecurityManager: Changing view acls to: manvendratomar 15/08/16 15:20:53
 INFO SecurityManager: Changing modify acls to: manvendratomar 15/08/16
 15:20:53 INFO SecurityManager: SecurityManager: authentication disabled; ui
 acls disabled; users with view permissions: Set(manvendratomar); users with
 modify permissions: Set(manvendratomar) 15/08/16 15:20:53 INFO Slf4jLogger:
 Slf4jLogger started 15/08/16 15:20:53 INFO Remoting: Starting remoting
 15/08/16 15:20:54 INFO Remoting: Remoting started; listening on addresses
 :[akka.tcp://sparkDriver@10.133.201.51:63985] 15/08/16 15:20:54 INFO
 Utils:
 Successfully started service 'sparkDriver' on port 63985. 15/08/16 15:20:54
 INFO SparkEnv: Registering MapOutputTracker 15/08/16 15:20:54 INFO
 SparkEnv:
 Registering BlockManagerMaster 15/08/16 15:20:54 INFO DiskBlockManager:
 Created local directory at

 /private/var/folders/6z/2j2qj3rn3z32tymm_9ybz960gn/T/spark-bd1c72c8-5d27-4751-ae6f-f298451e4f66/blockmgr-ed753ee2-726e-45ae-97a2-0c82923262ef
 15/08/16 15:20:54 INFO MemoryStore: MemoryStore started with capacity 265.4
 MB 15/08/16 15:20:54 INFO HttpFileServer: HTTP File server directory is

 /private/var/folders/6z/2j2qj3rn3z32tymm_9ybz960gn/T/spark-bd1c72c8-5d27-4751-ae6f-f298451e4f66/httpd-68c0afb2-92ef-4ddd-9ffc-639ccac81d6a
 15/08/16 15:20:54 INFO HttpServer: Starting HTTP Server 15/08/16 15:20:54
 INFO Utils: Successfully started service 'HTTP file server' on port 63986.
 15/08/16 15:20:54 INFO SparkEnv: Registering OutputCommitCoordinator
 15/08/16 15:20:54 INFO Utils: Successfully started service 'SparkUI' on
 port
 4040. 15/08/16 15:20:54 INFO SparkUI: Started SparkUI at
 http://10.133.201.51:4040 15/08/16 15:20:54 INFO SparkContext: Added JAR
 target/scala-2.11/spark_matrix_2.11-1.0.jar at
 http://10.133.201.51:63986/jars/spark_matrix_2.11-1.0.jar with timestamp
 1439718654603 15/08/16 15:20:54 INFO Executor: Starting executor ID driver
 on host localhost 15/08/16 15:20:54 INFO Utils: Successfully started
 service
 'org.apache.spark.network.netty.NettyBlockTransferService' on port 63987.
 15/08/16 15:20:54 INFO NettyBlockTransferService: Server created on 63987
 15/08/16 15:20:54 INFO BlockManagerMaster: Trying to register BlockManager
 15/08/16 15:20:54 INFO BlockManagerMasterEndpoint: Registering block
 manager
 localhost:63987 with 265.4 MB RAM, BlockManagerId(driver, localhost, 63987)
 15/08/16 15:20:54 INFO BlockManagerMaster: Registered BlockManager 15/08/16
 15:20:55 INFO MemoryStore: ensureFreeSpace(157248) called with curMem=0,
 maxMem=278302556 15/08/16 15:20:55 INFO MemoryStore: Block broadcast_0
 stored as values in memory (estimated size 153.6 KB, free 265.3 MB)
 15/08/16
 15:20:55 INFO MemoryStore: ensureFreeSpace(14257) called with
 curMem=157248,
 maxMem=278302556 15/08/16 15:20:55 INFO MemoryStore: Block
 broadcast_0_piece0 stored as bytes in memory (estimated size 13.9 KB, free
 265.2 MB) 15/08/16 15:20:55 INFO BlockManagerInfo: Added broadcast_0_piece0
 in memory on localhost:63987 (size: 13.9 KB, free: 265.4 MB) 15/08/16
 15:20:55 INFO SparkContext: Created broadcast 0 from textFile at
 partition.scala:20 15/08/16 15:20:56 INFO FileInputFormat: Total input
 paths
 to process : 1 15/08/16 15:20:56 INFO SparkContext: Starting job: reduce at
 IndexedRowMatrix.scala:65 15/08/16 15:20:56 INFO DAGScheduler: Got job 0
 (reduce at IndexedRowMatrix.scala:65) with 1 output partitions
 (allowLocal=false) 15/08/16 15:20:56 INFO DAGScheduler: Final stage:
 ResultStage 0(reduce at IndexedRowMatrix.scala:65) 15/08/16 15:20:56 INFO
 DAGScheduler: Parents of final stage: List() 15/08/16 15:20:56 INFO
 DAGScheduler: Missing parents: List() 15/08/16 15:20:56 INFO DAGScheduler:
 Submitting ResultStage 0 (MapPartitionsRDD[6] at map at
 IndexedRowMatrix.scala:65), which has no missing parents 15/08/16 15:20:56
 INFO MemoryStore: ensureFreeSpace(4064) called with curMem=171505,
 maxMem=278302556 15/08/16 15:20:56 INFO MemoryStore: Block broadcast_1
 stored as values in memory (estimated size 4.0 KB, free 265.2 MB) 15/08/16
 15:20:56 INFO MemoryStore: ensureFreeSpace(2249) called with

Re: Can't understand the size of raw RDD and its DataFrame

2015-08-15 Thread Rishi Yadav

why are you expecting footprint of dataframe to be lower when it contains
more information ( RDD + Schema)

On Sat, Aug 15, 2015 at 6:35 PM, Todd bit1...@163.com wrote:

 Hi,
 With following code snippet, I cached the raw RDD(which is already in
 memory, but just for illustration) and its DataFrame.
 I thought that the df cache would take less space than the rdd cache,which
 is wrong because from the UI that I see the rdd cache takes 168B,while the
 df cache takes 272B.
 What data is cached when df.cache is called and actually cache the data?
 It looks that the df only cached the avg(age) which should be much smaller
 in size,

 val conf = new SparkConf().setMaster(local).setAppName(SparkSQL_Cache)
 val sc = new SparkContext(conf)
 val sqlContext = new SQLContext(sc)
 import sqlContext.implicits._
 val rdd=sc.parallelize(Array(Student(Jack,21), Student(Mary, 22)))
 rdd.cache
 rdd.toDF().registerTempTable(TBL_STUDENT)
 val df = sqlContext.sql(select avg(age) from TBL_STUDENT)
 df.cache()
 df.show

Re: [MLLIB] Anyone tried correlation with RDD[Vector] ?

2015-07-23 Thread Rishi Yadav

can you explain what transformation is failing. Here's a simple example.

http://www.infoobjects.com/spark-calculating-correlation-using-rdd-of-vectors/

On Thu, Jul 23, 2015 at 5:37 AM, saif.a.ell...@wellsfargo.com wrote:

  I tried with a RDD[DenseVector] but RDDs are not transformable, so T+
 RDD[DenseVector] not : RDD[Vector] and can’t get to use the RDD input
 method of correlation.

 Thanks,
 Saif

Re: No suitable driver found for jdbc:mysql://

2015-07-22 Thread Rishi Yadav

try setting --driver-class-path

On Wed, Jul 22, 2015 at 3:45 PM, roni roni.epi...@gmail.com wrote:

 Hi All,
  I have a cluster with spark 1.4.
 I am trying to save data to mysql but getting error

 Exception in thread main java.sql.SQLException: No suitable driver found
 for jdbc:mysql://.rds.amazonaws.com:3306/DAE_kmer?user=password=


 *I looked at - https://issues.apache.org/jira/browse/SPARK-8463
 https://issues.apache.org/jira/browse/SPARK-8463 and added the connector
 jar to the same location as on Master using copy-dir script.*

 *But I am still getting the same error. This sued to work with 1.3.*

 *This is my command  to run the program - **$SPARK_HOME/bin/spark-submit
 --jars
 /root/spark/lib/mysql-connector-java-5.1.35/mysql-connector-java-5.1.35-bin.jar
 --conf
 spark.executor.extraClassPath=/root/spark/lib/mysql-connector-java-5.1.35/mysql-connector-java-5.1.35-bin.jar
 --conf spark.executor.memory=55g --driver-memory=55g
 --master=spark://ec2-52-25-191-999.us-west-2.compute.amazonaws.com:7077
 http://ec2-52-25-191-999.us-west-2.compute.amazonaws.com:7077  --class
 saveBedToDB  target/scala-2.10/adam-project_2.10-1.0.jar*

 *What else can I Do ?*

 *Thanks*

 *-Roni*

Re: How to use DataFrame with MySQL

2015-03-23 Thread Rishi Yadav

for me, it's only working if I set --driver-class-path to mysql library.

On Sun, Mar 22, 2015 at 11:29 PM, gavin zhang gavin@gmail.com wrote:

 OK，I found what the problem is: It couldn't work with
 mysql-connector-5.0.8.
 I updated the connector version to 5.1.34 and it worked.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/How-to-use-DataFrame-with-MySQL-tp22178p22182.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: Input validation for LogisticRegressionWithSGD

2015-03-15 Thread Rishi Yadav

ca you share some sample data

On Sun, Mar 15, 2015 at 8:51 PM, Rohit U rjupadhy...@gmail.com wrote:

 Hi,

 I am trying to run  LogisticRegressionWithSGD on RDD of LabeledPoints
 loaded using loadLibSVMFile:

 val logistic: RDD[LabeledPoint] = MLUtils.loadLibSVMFile(sc,
 s3n://logistic-regression/epsilon_normalized)

 val model = LogisticRegressionWithSGD.train(logistic, 100)

 It gives an input validation error after about 10 minutes:

 org.apache.spark.SparkException: Input validation failed.
 at
 org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm.run(GeneralizedLinearAlgorithm.scala:162)
 at
 org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm.run(GeneralizedLinearAlgorithm.scala:146)
 at
 org.apache.spark.mllib.classification.LogisticRegressionWithSGD$.train(LogisticRegression.scala:157)
 at
 org.apache.spark.mllib.classification.LogisticRegressionWithSGD$.train(LogisticRegression.scala:192)

 From reading this bug report (
 https://issues.apache.org/jira/browse/SPARK-2575) since I am loading
 LibSVM format file there should be only 0/1 in the dataset and should not
 be facing the issue in the bug report. Is there something else I'm missing
 here?

 Thanks!

Re: Spark Release 1.3.0 DataFrame API

2015-03-14 Thread Rishi Yadav

programmatically specifying Schema needs

 import org.apache.spark.sql.type._

for StructType and StructField to resolve.

On Sat, Mar 14, 2015 at 10:07 AM, Sean Owen so...@cloudera.com wrote:

 Yes I think this was already just fixed by:

 https://github.com/apache/spark/pull/4977

 a .toDF() is missing

 On Sat, Mar 14, 2015 at 4:16 PM, Nick Pentreath
 nick.pentre...@gmail.com wrote:
  I've found people.toDF gives you a data frame (roughly equivalent to the
  previous Row RDD),
 
  And you can then call registerTempTable on that DataFrame.
 
  So people.toDF.registerTempTable(people) should work
 
 
 
  —
  Sent from Mailbox
 
 
  On Sat, Mar 14, 2015 at 5:33 PM, David Mitchell 
 jdavidmitch...@gmail.com
  wrote:
 
 
  I am pleased with the release of the DataFrame API.  However, I started
  playing with it, and neither of the two main examples in the
 documentation
  work: http://spark.apache.org/docs/1.3.0/sql-programming-guide.html
 
  Specfically:
 
  Inferring the Schema Using Reflection
  Programmatically Specifying the Schema
 
 
  Scala 2.11.6
  Spark 1.3.0 prebuilt for Hadoop 2.4 and later
 
  Inferring the Schema Using Reflection
  scala people.registerTempTable(people)
  console:31: error: value registerTempTable is not a member of
  org.apache.spark
  .rdd.RDD[Person]
people.registerTempTable(people)
   ^
 
  Programmatically Specifying the Schema
  scala val peopleDataFrame = sqlContext.createDataFrame(people, schema)
  console:41: error: overloaded method value createDataFrame with
  alternatives:
(rdd: org.apache.spark.api.java.JavaRDD[_],beanClass:
  Class[_])org.apache.spar
  k.sql.DataFrame and
(rdd: org.apache.spark.rdd.RDD[_],beanClass:
  Class[_])org.apache.spark.sql.Dat
  aFrame and
(rowRDD:
  org.apache.spark.api.java.JavaRDD[org.apache.spark.sql.Row],columns:
  java.util.List[String])org.apache.spark.sql.DataFrame and
(rowRDD:
  org.apache.spark.api.java.JavaRDD[org.apache.spark.sql.Row],schema: o
  rg.apache.spark.sql.types.StructType)org.apache.spark.sql.DataFrame
 and
(rowRDD: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row],schema:
  org.apache
  .spark.sql.types.StructType)org.apache.spark.sql.DataFrame
   cannot be applied to (org.apache.spark.rdd.RDD[String],
  org.apache.spark.sql.ty
  pes.StructType)
 val df = sqlContext.createDataFrame(people, schema)
 
  Any help would be appreciated.
 
  David
 
 

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: Define size partitions

2015-01-30 Thread Rishi Yadav

if you are only concerned about big partition size you can specify number
of partitions as an additional parameter while loading files form hdfs.

On Fri, Jan 30, 2015 at 9:47 AM, Sven Krasser kras...@gmail.com wrote:

 You can also use your InputFormat/RecordReader in Spark, e.g. using
 newAPIHadoopFile. See here:
 https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.SparkContext
 .
 -Sven

 On Fri, Jan 30, 2015 at 6:50 AM, Guillermo Ortiz konstt2...@gmail.com
 wrote:

 Hi,

 I want to process some files, there're a king of big, dozens of
 gigabytes each one. I get them like a array of bytes and there's an
 structure inside of them.

 I have a header which describes the structure. It could be like:
 Number(8bytes) Char(16bytes) Number(4 bytes) Char(1bytes), ..
 This structure appears N times on the file.

 So, I could know the size of each block since it's fix. There's not
 separator among block and block.

 If I would do this with MapReduce, I could implement a new
 RecordReader and InputFormat  to read each block because I know the
 size of them and I'd fix the split size in the driver. (blockX1000 for
 example). On this way, I could know that each split for each mapper
 has complete blocks and there isn't a piece of the last block in the
 next split.

 Spark works with RDD and partitions, How could I resize  each
 partition to do that?? is it possible? I guess that Spark doesn't use
 the RecordReader and these classes for these tasks.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




 --
 http://sites.google.com/site/krasser/?utm_source=sig

RangePartitioner

2015-01-20 Thread Rishi Yadav

I am joining two tables as below, the program stalls at below log line and 
never proceeds.
What might be the issue and possible solution?


 INFO SparkContext: Starting job: RangePartitioner at Exchange.scala:79


Table 1 has  450 columns
Table2 has  100 columns


Both tables have few million rows




            val table1= myTable1.as('table1)
            val table2= myTable2.as('table2)
            val results= table1.join(table2,LeftOuter,Some(table1.Id.attr === 
table2.id.attr ))




           println(results.count())

Thanks and Regards,
Rishi
@meditativesoul

Re: Problem with StreamingContext - getting SPARK-2243

2015-01-08 Thread Rishi Yadav

you can also access SparkConf using sc.getConf in Spark shell though for
StreamingContext you can directly refer sc as Akhil suggested.

On Sun, Dec 28, 2014 at 12:13 AM, Akhil Das ak...@sigmoidanalytics.com
wrote:

In the shell you could do:

val ssc = StreamingContext(*sc*, Seconds(1))

as *sc* is the SparkContext, which is already instantiated.

Thanks
Best Regards

On Sun, Dec 28, 2014 at 6:55 AM, Thomas Frisk tfris...@gmail.com wrote:

Yes you are right - thanks for that :)

On 27 December 2014 at 23:18, Ilya Ganelin ilgan...@gmail.com wrote:

Are you trying to do this in the shell? Shell is instantiated with a
spark context named sc.

-Ilya Ganelin

On Sat, Dec 27, 2014 at 5:24 PM, tfrisk tfris...@gmail.com wrote:

Hi,

Doing:
val ssc = new StreamingContext(conf, Seconds(1))

and getting:
Only one SparkContext may be running in this JVM (see SPARK-2243). To
ignore this error, set spark.driver.allowMultipleContexts = true.

But I dont think that I have another SparkContext running. Is there any
way
I can check this or force kill ? I've tried restarting the server as
I'm
desperate but still I get the same issue. I was not getting this
earlier
today.

Any help much appreciated .

Thanks,

Thomas

--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Problem-with-StreamingContext-getting-SPARK-2243-tp20869.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: JavaRDD (Data Aggregation) based on key

2015-01-08 Thread Rishi Yadav

One approach is  to first transform this RDD into a PairRDD by taking the
field you are going to do aggregation on as key

On Tue, Dec 23, 2014 at 1:47 AM, sachin Singh sachin.sha...@gmail.com
wrote:

 Hi,
 I have a csv file having fields as a,b,c .
 I want to do aggregation(sum,average..) based on any field(a,b or c) as per
 user input,
 using Apache Spark Java API,Please Help Urgent!

 Thanks in advance,

 Regards
 Sachin



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/JavaRDD-Data-Aggregation-based-on-key-tp20828.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: Profiling a spark application.

2015-01-08 Thread Rishi Yadav

as per my understanding RDDs do not get replicated, underlying Data does if
it's in HDFS.

On Thu, Dec 25, 2014 at 9:04 PM, rapelly kartheek kartheek.m...@gmail.com
wrote:

 Hi,

 I want to find the time taken for replicating an rdd in spark cluster
 along with the computation time on the replicated rdd.

 Can someone please suggest some ideas?

 Thank you

Re: Implement customized Join for SparkSQL

2015-01-08 Thread Rishi Yadav

Hi Kevin,

Say A has 10 ids, so you are pulling data from B's data source only for
these 10 ids?

What if you load A and B as separate schemaRDDs and then do join. Spark
will optimize the path anyway when action is fired .

On Mon, Jan 5, 2015 at 2:28 AM, Dai, Kevin yun...@ebay.com wrote:

  Hi, All



 Suppose I want to join two tables A and B as follows:



 Select * from A join B on A.id = B.id



 A is a file while B is a database which indexed by id and I wrapped it by
 Data source API.

 The desired join flow is:

 1.   Generate A’s RDD[Row]

 2.   Generate B’s RDD[Row] from A by using A’s id and B’s data source
 api to get row from the database

 3.   Merge these two RDDs to the final RDD[Row]



 However it seems existing join strategy doesn’t support it?



 Any way to achieve it?



 Best Regards,

 Kevin.

Re: sparkContext.textFile does not honour the minPartitions argument

2015-01-01 Thread Rishi Yadav

Hi Ankit,

Optional number of partitions value is to increase number of partitions not
reduce it from default value.

On Thu, Jan 1, 2015 at 10:43 AM, Aniket Bhatnagar 
aniket.bhatna...@gmail.com wrote:

 I am trying to read a file into a single partition but it seems like
 sparkContext.textFile ignores the passed minPartitions value. I know I can
 repartition the RDD but I was curious to know if this is expected or if
 this is a bug that needs to be further investigated?

Re: Cached RDD

2014-12-30 Thread Rishi Yadav

Without caching, each action is recomputed. So assuming rdd2 and rdd3
result in separate actions answer is yes.

On Mon, Dec 29, 2014 at 7:53 PM, Corey Nolet cjno...@gmail.com wrote:

 If I have 2 RDDs which depend on the same RDD like the following:

 val rdd1 = ...

 val rdd2 = rdd1.groupBy()...

 val rdd3 = rdd1.groupBy()...


 If I don't cache rdd1, will it's lineage be calculated twice (one for rdd2
 and one for rdd3)?

Re: reduceByKey and empty output files

2014-11-30 Thread Rishi Yadav

How big is your input dataset?

On Thursday, November 27, 2014, Praveen Sripati praveensrip...@gmail.com
wrote:

 Hi,

 When I run the below program, I see two files in the HDFS because the
 number of partitions in 2. But, one of the file is empty. Why is it so? Is
 the work not distributed equally to all the tasks?

 textFile.flatMap(lambda line: line.split()).map(lambda word: (word, 1)).
 *reduceByKey*(lambda a, b: a+b).*repartition(2)*
 .saveAsTextFile(hdfs://localhost:9000/user/praveen/output/)

 Thanks,
 Praveen



-- 
- Rishi

Re: optimize multiple filter operations

2014-11-28 Thread Rishi Yadav

you can try (scala version = you convert to python)

val set = initial.groupBy( x = if (x == something) key1 else key2)

This would do one pass over original data.

On Fri, Nov 28, 2014 at 8:21 AM, mrm ma...@skimlinks.com wrote:

 Hi,

 My question is:

 I have multiple filter operations where I split my initial rdd into two
 different groups. The two groups cover the whole initial set. In code, it's
 something like:

 set1 = initial.filter(lambda x: x == something)
 set2 = initial.filter(lambda x: x != something)

 By doing this, I am doing two passes over the data. Is there any way to
 optimise this to do it in a single pass?

 Note: I was trying to look in the mailing list to see if this question has
 been asked already, but could not find it.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/optimize-multiple-filter-operations-tp20010.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark SQL Programming Guide - registerTempTable Error

2014-11-24 Thread Rishi Yadav

We keep conf  as symbolic link so that upgrade is as simple as drop-in
replacement

On Monday, November 24, 2014, riginos samarasrigi...@gmail.com wrote:

 OK thank you very much for that!
 On 23 Nov 2014 21:49, Denny Lee [via Apache Spark User List] [hidden
 email] http://user/SendEmail.jtp?type=nodenode=19638i=0 wrote:

 It sort of depends on your environment.  If you are running on your local
 environment, I would just download the latest Spark 1.1 binaries and you'll
 be good to go.  If its a production environment, it sort of depends on how
 you are setup (e.g. AWS, Cloudera, etc.)

 On Sun Nov 23 2014 at 11:27:49 AM riginos [hidden email]
 http://user/SendEmail.jtp?type=nodenode=19598i=0 wrote:

 That was the problem ! Thank you Denny for your fast response!
 Another quick question:
 Is there any way to update spark to 1.1.0 fast?




 --
 View this message in context: http://apache-spark-user-list.
 1001560.n3.nabble.com/Spark-SQL-Programming-Guide-
 registerTempTable-Error-tp19591p19595.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: [hidden email]
 http://user/SendEmail.jtp?type=nodenode=19598i=1
 For additional commands, e-mail: [hidden email]
 http://user/SendEmail.jtp?type=nodenode=19598i=2



 --
  If you reply to this email, your message will be added to the
 discussion below:

 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Programming-Guide-registerTempTable-Error-tp19591p19598.html
  To unsubscribe from Spark SQL Programming Guide - registerTempTable
 Error, click here.
 NAML
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml


 --
 View this message in context: Re: Spark SQL Programming Guide -
 registerTempTable Error
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Programming-Guide-registerTempTable-Error-tp19591p19638.html
 Sent from the Apache Spark User List mailing list archive
 http://apache-spark-user-list.1001560.n3.nabble.com/ at Nabble.com.



-- 
- Rishi

Re: Declaring multiple RDDs and efficiency concerns

2014-11-14 Thread Rishi Yadav

how about using fluent style of Scala programming.


On Fri, Nov 14, 2014 at 8:31 AM, Simone Franzini captainfr...@gmail.com
wrote:

 Let's say I have to apply a complex sequence of operations to a certain
 RDD.
 In order to make code more modular/readable, I would typically have
 something like this:

 object myObject {
   def main(args: Array[String]) {
 val rdd1 = function1(myRdd)
 val rdd2 = function2(rdd1)
 val rdd3 = function3(rdd2)
   }

   def function1(rdd: RDD) : RDD = { doSomething }
   def function2(rdd: RDD) : RDD = { doSomethingElse }
   def function3(rdd: RDD) : RDD = { doSomethingElseYet }
 }

 So I am explicitly declaring vals for the intermediate steps. Does this
 end up using more storage than if I just chained all of the operations and
 declared only one val instead?
 If yes, is there a better way to chain together the operations?
 Ideally I would like to do something like:

 val rdd = function1.function2.function3

 Is there a way I can write the signature of my functions to accomplish
 this? Is this also an efficiency issue or just a stylistic one?

 Simone Franzini, PhD

 http://www.linkedin.com/in/simonefranzini

Re: Assigning input files to spark partitions

2014-11-13 Thread Rishi Yadav

If your data is in hdfs and you are reading as textFile and each file is
less than block size, my understanding is it would always have one
partition per file.

On Thursday, November 13, 2014, Daniel Siegmann daniel.siegm...@velos.io
wrote:

 Would it make sense to read each file in as a separate RDD? This way you
 would be guaranteed the data is partitioned as you expected.

 Possibly you could then repartition each of those RDDs into a single
 partition and then union them. I think that would achieve what you expect.
 But it would be easy to accidentally screw this up (have some operation
 that causes a shuffle), so I think you're better off just leaving them as
 separate RDDs.

 On Wed, Nov 12, 2014 at 10:27 PM, Pala M Muthaia 
 mchett...@rocketfuelinc.com
 javascript:_e(%7B%7D,'cvml','mchett...@rocketfuelinc.com'); wrote:

 Hi,

 I have a set of input files for a spark program, with each file
 corresponding to a logical data partition. What is the API/mechanism to
 assign each input file (or a set of files) to a spark partition, when
 initializing RDDs?

 When i create a spark RDD pointing to the directory of files, my
 understanding is it's not guaranteed that each input file will be treated
 as separate partition.

 My job semantics require that the data is partitioned, and i want to
 leverage the partitioning that has already been done, rather than
 repartitioning again in the spark job.

 I tried to lookup online but haven't found any pointers so far.


 Thanks
 pala




 --
 Daniel Siegmann, Software Developer
 Velos
 Accelerating Machine Learning

 54 W 40th St, New York, NY 10018
 E: daniel.siegm...@velos.io
 javascript:_e(%7B%7D,'cvml','daniel.siegm...@velos.io'); W: www.velos.io



-- 
- Rishi

Re: Question about textFileStream

2014-11-12 Thread Rishi Yadav

yes, can you always specify minimum number of partitions and that would
force some parallelism ( assuming you have enough cores)

On Wed, Nov 12, 2014 at 9:36 AM, Saiph Kappa saiph.ka...@gmail.com wrote:

 What if the window is of 5 seconds, and the file takes longer than 5
 seconds to be completely scanned? It will still attempt to load the whole
 file?

 On Mon, Nov 10, 2014 at 6:24 PM, Soumitra Kumar kumar.soumi...@gmail.com
 wrote:

 Entire file in a window.

 On Mon, Nov 10, 2014 at 9:20 AM, Saiph Kappa saiph.ka...@gmail.com
 wrote:

 Hi,

 In my application I am doing something like this new
 StreamingContext(sparkConf, Seconds(10)).textFileStream(logs/), and I
 get some unknown exceptions when I copy a file with about 800 MB to that
 folder (logs/). I have a single worker running with 512 MB of memory.

 Anyone can tell me if every 10 seconds spark reads parts of that big
 file, or if it attempts to read the entire file in a single window? How
 does it work?

 Thanks.

Re: join 2 tables

2014-11-12 Thread Rishi Yadav

please use join syntax.

On Wed, Nov 12, 2014 at 8:57 AM, Franco Barrientos 
franco.barrien...@exalitica.com wrote:

 I have 2 tables in a hive context, and I want to select one field of each
 table where id’s of each table are equal. For example,



 *val tmp2=sqlContext.sql(select a.ult_fecha,b.pri_fecha from
 fecha_ult_compra_u3m as a, fecha_pri_compra_u3m as b where a.id
 http://a.id=b.id http://b.id)*



 but i get an error:





 *Franco Barrientos*
 Data Scientist

 Málaga #115, Of. 1003, Las Condes.
 Santiago, Chile.
 (+562)-29699649
 (+569)-76347893

 franco.barrien...@exalitica.com

 www.exalitica.com

 [image: http://exalitica.com/web/img/frim.png]

Re: S3 table to spark sql

2014-11-11 Thread Rishi Yadav

simple

scala val date = new
java.text.SimpleDateFormat(mmdd).parse(fechau3m)

should work. Replace mmdd with the format fechau3m is in.

If you want to do it at case class level:

val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
//HiveContext always a good idea

import sqlContext.createSchemaRDD



case class trx_u3m(id: String, local: String, fechau3m: java.util.Date,
rubro: Int, sku: String, unidades: Double, monto: Double)



val tabla = 
sc.textFile(s3n://exalitica.com/trx_u3m/trx_u3m.txt).map(_.split(,)).map(p
= trx_u3m(p(0).trim.toString, p(1).trim.toString, new
java.text.SimpleDateFormat(mmdd).parse(p(2).trim.toString),
p(3).trim.toInt, p(4).trim.toString, p(5).trim.toDouble,
p(6).trim.toDouble))

tabla.registerTempTable(trx_u3m)


On Tue, Nov 11, 2014 at 11:11 AM, Franco Barrientos 
franco.barrien...@exalitica.com wrote:

 How can i create a date field in spark sql? I have a S3 table and  i load
 it into a RDD.



 val sqlContext = new org.apache.spark.sql.SQLContext(sc)

 import sqlContext.createSchemaRDD



 case class trx_u3m(id: String, local: String, fechau3m: String, rubro:
 Int, sku: String, unidades: Double, monto: Double)



 val tabla = 
 sc.textFile(s3n://exalitica.com/trx_u3m/trx_u3m.txt).map(_.split(,)).map(p
 = trx_u3m(p(0).trim.toString, p(1).trim.toString, p(2).trim.toString,
 p(3).trim.toInt, p(4).trim.toString, p(5).trim.toDouble,
 p(6).trim.toDouble))

 tabla.registerTempTable(trx_u3m)



 Now my problema i show can i transform string variable into date variables
 (fechau3m)?



 *Franco Barrientos*
 Data Scientist

 Málaga #115, Of. 1003, Las Condes.
 Santiago, Chile.
 (+562)-29699649
 (+569)-76347893

 franco.barrien...@exalitica.com

 www.exalitica.com

 [image: http://exalitica.com/web/img/frim.png]

Re: Spark SQL : how to find element where a field is in a given set

2014-11-02 Thread Rishi Yadav

did you create SQLContext?

On Sat, Nov 1, 2014 at 7:51 PM, abhinav chowdary abhinav.chowd...@gmail.com
 wrote:

 I have same requirement of passing list of values to in clause, when i am
 trying to do

 i am getting below error

 scala val longList = Seq[Expression](a, b)
 console:11: error: type mismatch;
  found   : String(a)
  required: org.apache.spark.sql.catalyst.expressions.Expression
val longList = Seq[Expression](a, b)

 Thanks


 On Fri, Aug 29, 2014 at 3:52 PM, Michael Armbrust mich...@databricks.com
 wrote:

 This feature was not part of that version.  It will be in 1.1.


 On Fri, Aug 29, 2014 at 12:33 PM, Jaonary Rabarisoa jaon...@gmail.com
 wrote:


 1.0.2


 On Friday, August 29, 2014, Michael Armbrust mich...@databricks.com
 wrote:

 What version are you using?



 On Fri, Aug 29, 2014 at 2:22 AM, Jaonary Rabarisoa jaon...@gmail.com
 wrote:

 Still not working for me. I got a compilation error : *value in is
 not a member of Symbol.* Any ideas ?


 On Fri, Aug 29, 2014 at 9:46 AM, Michael Armbrust 
 mich...@databricks.com wrote:

 To pass a list to a variadic function you can use the type ascription
 :_*

 For example:

 val longList = Seq[Expression](a, b, ...)
 table(src).where('key in (longList: _*))

 Also, note that I had to explicitly specify Expression as the type
 parameter of Seq to ensure that the compiler converts a and b into
 Spark SQL expressions.




 On Thu, Aug 28, 2014 at 11:52 PM, Jaonary Rabarisoa 
 jaon...@gmail.com wrote:

 ok, but what if I have a long list do I need to hard code like this
 every element of my list of is there a function that translate a list 
 into
 a tuple ?


 On Fri, Aug 29, 2014 at 3:24 AM, Michael Armbrust 
 mich...@databricks.com wrote:

 You don't need the Seq, as in is a variadic function.

 personTable.where('name in (foo, bar))



 On Thu, Aug 28, 2014 at 3:09 AM, Jaonary Rabarisoa 
 jaon...@gmail.com wrote:

 Hi all,

 What is the expression that I should use with spark sql DSL if I
 need to retreive
 data with a field in a given set.
 For example :

 I have the following schema

 case class Person(name: String, age: Int)

 And I need to do something like :

 personTable.where('name in Seq(foo, bar)) ?


 Cheers.


 Jaonary












 --
 Warm Regards
 Abhinav Chowdary

Re: Bug in Accumulators...

2014-10-25 Thread Rishi Yadav

works fine. Spark 1.1.0 on REPL
On Sat, Oct 25, 2014 at 1:41 PM, octavian.ganea octavian.ga...@inf.ethz.ch
wrote:

 There is for sure a bug in the Accumulators code.

 More specifically, the following code works well as expected:

   def main(args: Array[String]) {
 val conf = new SparkConf().setAppName(EL LBP SPARK)
 val sc = new SparkContext(conf)
 val accum = sc.accumulator(0)
 sc.parallelize(Array(1, 2, 3, 4)).foreach(x = accum += x)
 sc.stop
   }

 but the following code (adding just a for loop) gives the weird error :
   def run(args: Array[String]) {
 val conf = new SparkConf().setAppName(EL LBP SPARK)
 val sc = new SparkContext(conf)
 val accum = sc.accumulator(0)
 for (i - 1 to 10) {
   sc.parallelize(Array(1, 2, 3, 4)).foreach(x = accum += x)
 }
 sc.stop
   }


 the error:
 Exception in thread main org.apache.spark.SparkException: Job aborted due
 to stage failure: Task not serializable: java.io.NotSerializableException:
 org.apache.spark.SparkContext
 at
 org.apache.spark.scheduler.DAGScheduler.org
 $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033)
 at

 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1017)
 at

 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1015)
 at

 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at
 scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1015)
 at
 org.apache.spark.scheduler.DAGScheduler.org
 $apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:770)
 at
 org.apache.spark.scheduler.DAGScheduler.org
 $apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:713)
 at

 org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:697)
 at

 org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1176)
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
 at akka.actor.ActorCell.invoke(ActorCell.scala:456)
 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
 at akka.dispatch.Mailbox.run(Mailbox.scala:219)
 at

 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
 at
 scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 at

 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 at
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
 at

 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)


 Can someone confirm this bug ?

 Related to this:

 http://apache-spark-user-list.1001560.n3.nabble.com/Accumulators-Task-not-serializable-java-io-NotSerializableException-org-apache-spark-SparkContext-td17262.html


 http://apache-spark-user-list.1001560.n3.nabble.com/NullPointerException-when-using-Accumulators-on-cluster-td17261.html



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Bug-in-Accumulators-tp17263.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: spark sql: join sql fails after sqlCtx.cacheTable()

2014-10-21 Thread Rishi Yadav

Hi Tridib,

I changed SQLContext to HiveContext and it started working. These are steps
I used.

val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
val person = sqlContext.jsonFile(json/person.json)
person.printSchema()
person.registerTempTable(person)
val address = sqlContext.jsonFile(json/address.json)
address.printSchema()
address.registerTempTable(address)
sqlContext.cacheTable(person)
sqlContext.cacheTable(address)
val rs2 = sqlContext.sql(select p.id,p.name,a.city from person p join
address a on (p.id = a.id)).collect.foreach(println)


Rishi@InfoObjects

*Pure-play Big Data Consulting*


On Tue, Oct 21, 2014 at 5:47 AM, tridib tridib.sama...@live.com wrote:

 val sqlContext = new org.apache.spark.sql.SQLContext(sc)
 val personPath = /hdd/spark/person.json
 val person = sqlContext.jsonFile(personPath)
 person.printSchema()
 person.registerTempTable(person)
 val addressPath = /hdd/spark/address.json
 val address = sqlContext.jsonFile(addressPath)
 address.printSchema()
 address.registerTempTable(address)
 sqlContext.cacheTable(person)
 sqlContext.cacheTable(address)
 val rs2 = sqlContext.sql(SELECT p.id, p.name, a.city FROM person p,
 address
 a where p.id = a.id limit 10).collect.foreach(println)

 person.json
 {id:1,name:Mr. X}

 address.json
 {city:Earth,id:1}



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-join-sql-fails-after-sqlCtx-cacheTable-tp16893p16914.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: How to write a RDD into One Local Existing File?

2014-10-19 Thread Rishi Yadav

Write to hdfs and then get one file locally bu using hdfs dfs -getmerge...

On Friday, October 17, 2014, Sean Owen so...@cloudera.com wrote:

You can save to a local file. What are you trying and what doesn't work?

You can output one file by repartitioning to 1 partition but this is
probably not a good idea as you are bottlenecking the output and some
upstream computation by disabling parallelism.

How about just combining the files on HDFS afterwards? or just reading
all the files instead of 1? You can hdfs dfs -cat a bunch of files at
once.

On Fri, Oct 17, 2014 at 6:46 PM, Parthus peng.wei@gmail.com
javascript:; wrote:
Hi,

I have a spark mapreduce task which requires me to write the final rdd
to an
existing local file (appending to this file). I tried two ways but
neither
works well:

1. use saveAsTextFile() api. Spark 1.1.0 claims that this API can write
to
local, but I never make it work. Moreover, the result is not one file
but a
series of part-x files which is not what I hope to get.

2. collect the rdd to an array and write it to the driver node using
Java's
File IO. There are also two problems: 1) my RDD is huge(1TB), which
cannot
fit into the memory of one driver node. I have to split the task into
small
pieces and collect them part by part and write; 2) During the writing by
Java IO, the Spark Mapreduce task has to wait, which is not efficient.

Could anybody provide me an efficient way to solve this problem? I wish
that
the solution could be like: appending a huge rdd to a local file without
pausing the MapReduce during writing?

--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-write-a-RDD-into-One-Local-Existing-File-tp16720.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org javascript:;
For additional commands, e-mail: user-h...@spark.apache.org
javascript:;

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org javascript:;
For additional commands, e-mail: user-h...@spark.apache.org javascript:;

--
- Rishi

Re: Spark Streaming Twitter Example Error

2014-08-21 Thread Rishi Yadav

please add following three libraries to your class path.

spark-streaming-twitter_2.10-1.0.0.jar
 twitter4j-core-3.0.3.jar
twitter4j-stream-3.0.3.jar


On Thu, Aug 21, 2014 at 1:09 PM, danilopds danilob...@gmail.com wrote:

 Hi!

 I'm beginning with the development in Spark Streaming.. And I'm learning
 with the examples available in the spark directory. There are several
 applications and I want to make modifications.

 I can execute the TwitterPopularTags normally with command:
 ./bin/run-example TwitterPopularTags auth

 So,
 I moved the source code to a separate folder with the structure:
 ./src/main/scala/

 With the files:
 -TwitterPopularTags
 -TwitterUtils
 -StreamingExamples
 -TwitterInputDStream

 But when I run the command:
 ./bin/spark-submit --class TwitterPopularTags --master local[4]
 /MY_DIR/TwitterTest/target/scala-2.10/simple-project_2.10-1.0.jar auth

 I receive the following error:
 Exception in thread main java.lang.NoClassDefFoundError:
 twitter4j/auth/Authorization
 at TwitterUtils$.createStream(TwitterUtils.scala:42)
 at TwitterPopularTags$.main(TwitterPopularTags.scala:65)
 at TwitterPopularTags.main(TwitterPopularTags.scala)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at

 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at

 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at
 org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:303)
 at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55)
 at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
 Caused by: java.lang.ClassNotFoundException: twitter4j.auth.Authorization
 at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
 ... 10 more

 This is my sbt build file:
 name := Simple Project

 version := 1.0

 scalaVersion := 2.10.4

 libraryDependencies += org.apache.spark %% spark-core % 1.0.2

 libraryDependencies += org.apache.spark %% spark-streaming % 1.0.2

 libraryDependencies += org.twitter4j % twitter4j-core % 3.0.3

 libraryDependencies += org.twitter4j % twitter4j-stream % 3.0.3

 resolvers += Akka Repository at http://repo.akka.io/releases/;

 Can anybody help me?
 Thanks a lot!



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Twitter-Example-Error-tp12600.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: Equally split a RDD partition into two partition at the same node

Re: Error: Exception in thread main java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/HBaseConfiguration

Re: Spark can't fetch application jar after adding it to HTTP server

Re: Can't understand the size of raw RDD and its DataFrame

Re: [MLLIB] Anyone tried correlation with RDD[Vector] ?

Re: No suitable driver found for jdbc:mysql://

Re: How to use DataFrame with MySQL

Re: Input validation for LogisticRegressionWithSGD

Re: Spark Release 1.3.0 DataFrame API

Re: Define size partitions

RangePartitioner

Re: Problem with StreamingContext - getting SPARK-2243

Re: JavaRDD (Data Aggregation) based on key

Re: Profiling a spark application.

Re: Implement customized Join for SparkSQL

Re: sparkContext.textFile does not honour the minPartitions argument

Re: Cached RDD

Re: reduceByKey and empty output files

Re: optimize multiple filter operations

Re: Spark SQL Programming Guide - registerTempTable Error

Re: Declaring multiple RDDs and efficiency concerns

Re: Assigning input files to spark partitions

Re: Question about textFileStream

Re: join 2 tables

Re: S3 table to spark sql

Re: Spark SQL : how to find element where a field is in a given set

Re: Bug in Accumulators...

Re: spark sql: join sql fails after sqlCtx.cacheTable()

Re: How to write a RDD into One Local Existing File?

Re: Spark Streaming Twitter Example Error

30 matches

Site Navigation

Mail list logo

Footer information