Printing MLpipeline model in Python.

2016-03-14 Thread VISHNU SUBRAMANIAN
HI All,

I am using Spark 1.6 and Pyspark.

I am trying to build a Randomforest classifier model using mlpipeline and
in python.

When I am trying to print the model I get the below value.

RandomForestClassificationModel (uid=rfc_be9d4f681b92) with 10 trees

When I use MLLIB RandomForest model with toDebugString I get all the rules
used for building the model.


  Tree 0:
If (feature 53 <= 0.0)
 If (feature 49 <= 0.0)
  If (feature 3 <= 1741.0)
   If (feature 47 <= 0.0)
Predict: 0.0
   Else (feature 47 > 0.0)
Predict: 0.0
  Else (feature 3 > 1741.0)
   If (feature 47 <= 0.0)
Predict: 1.0
   Else (feature 47 > 0.0)

How can I achieve the same thing using MLpipelines model.

Thanks in Advance.

Vishnu


Re: how to covert millisecond time to SQL timeStamp

2016-02-01 Thread VISHNU SUBRAMANIAN
HI ,

If you need a data frame specific solution , you can try the below

df.select(from_unixtime(col("max(utcTimestamp)")/1000))

On Tue, 2 Feb 2016 at 09:44 Ted Yu  wrote:

> See related thread on using Joda DateTime:
> http://search-hadoop.com/m/q3RTtSfi342nveex1=RE+NPE+
> when+using+Joda+DateTime
>
> On Mon, Feb 1, 2016 at 7:44 PM, Kevin Mellott 
> wrote:
>
>> I've had pretty good success using Joda-Time
>>  for date/time manipulations
>> within Spark applications. You may be able to use the *DateTIme* constructor
>> below, if you are starting with milliseconds.
>>
>> DateTime
>>
>> public DateTime(long instant)
>>
>> Constructs an instance set to the milliseconds from 1970-01-01T00:00:00Z
>> using ISOChronology in the default time zone.
>> Parameters:instant - the milliseconds from 1970-01-01T00:00:00Z
>>
>> On Mon, Feb 1, 2016 at 5:51 PM, Andy Davidson <
>> a...@santacruzintegration.com> wrote:
>>
>>> What little I know about working with timestamps is based on
>>> https://databricks.com/blog/2015/09/16/spark-1-5-dataframe-api-highlights-datetimestring-handling-time-intervals-and-udafs.html
>>>
>>> Using the example of dates formatted into human friend strings ->
>>> timeStamps I was able to figure out how to convert Epoch times to
>>> timestamps. The same trick did not work for millisecond times.
>>>
>>> Any suggestions would be greatly appreciated.
>>>
>>>
>>> Andy
>>>
>>> Working with epoch times
>>> 
>>>
>>> ref: http://www.epochconverter.com/
>>>
>>> Epoch timestamp: 1456050620
>>>
>>> Timestamp in milliseconds: 145605062
>>>
>>> Human time (GMT): Sun, 21 Feb 2016 10:30:20 GMT
>>>
>>> Human time (your time zone): 2/21/2016, 2:30:20 AM
>>>
>>>
>>> # Epoch time stamp example
>>>
>>> data = [
>>>
>>>   ("1456050620", "1456050621", 1),
>>>
>>>   ("1456050622", "14560506203", 2),
>>>
>>>   ("14560506204", "14560506205", 3)]
>>>
>>> df = sqlContext.createDataFrame(data, ["start_time", "end_time", "id"])
>>>
>>> ​
>>>
>>> # convert epoch time strings in to spark timestamps
>>>
>>> df = df.select(
>>>
>>>   df.start_time.cast("long").alias("start_time"),
>>>
>>>   df.end_time.cast("long").alias("end_time"),
>>>
>>>   df.id)
>>>
>>> df.printSchema()
>>>
>>> df.show(truncate=False)
>>>
>>> ​
>>>
>>> # convert longs to timestamps
>>>
>>> df = df.select(
>>>
>>>   df.start_time.cast("timestamp").alias("start_time"),
>>>
>>>   df.end_time.cast("timestamp").alias("end_time"),
>>>
>>>   df.id)
>>>
>>> df.printSchema()
>>>
>>> df.show(truncate=False)
>>>
>>> ​
>>>
>>> root
>>>  |-- start_time: long (nullable = true)
>>>  |-- end_time: long (nullable = true)
>>>  |-- id: long (nullable = true)
>>>
>>> +---+---+---+
>>> |start_time |end_time   |id |
>>> +---+---+---+
>>> |1456050620 |1456050621 |1  |
>>> |1456050622 |14560506203|2  |
>>> |14560506204|14560506205|3  |
>>> +---+---+---+
>>>
>>> root
>>>  |-- start_time: timestamp (nullable = true)
>>>  |-- end_time: timestamp (nullable = true)
>>>  |-- id: long (nullable = true)
>>>
>>> +-+-+---+
>>> |start_time   |end_time |id |
>>> +-+-+---+
>>> |2016-02-21 02:30:20.0|2016-02-21 02:30:21.0|1  |
>>> |2016-02-21 02:30:22.0|2431-05-28 02:03:23.0|2  |
>>> |2431-05-28 02:03:24.0|2431-05-28 02:03:25.0|3  |
>>> +-+-+---+
>>>
>>>
>>> In [21]:
>>>
>>> # working with millisecond times
>>>
>>> data = [
>>>
>>>   ("145605062", "145605062", 1)]
>>>
>>>   df = sqlContext.createDataFrame(data, ["start_time", "end_time", "id"])
>>>
>>> ​
>>>
>>> # convert epoch time strings in to spark timestamps
>>>
>>> df = df.select(
>>>
>>>   df.start_time.cast("long").alias("start_time"),
>>>
>>>   df.end_time.cast("long").alias("end_time"),
>>>
>>>   df.id)
>>>
>>> df.printSchema()
>>>
>>> df.show(truncate=False)
>>>
>>> ​
>>>
>>> # convert longs to timestamps
>>>
>>> df = df.select(
>>>
>>>   df.start_time.cast("timestamp").alias("start_time"),
>>>
>>>   df.end_time.cast("timestamp").alias("end_time"),
>>>
>>>   df.id)
>>>
>>> df.printSchema()
>>>
>>> df.show(truncate=False)
>>>
>>> root
>>>  |-- start_time: long (nullable = true)
>>>  |-- end_time: long (nullable = true)
>>>  |-- id: long (nullable = true)
>>>
>>> +-+-+---+
>>> |start_time   |end_time |id |
>>> +-+-+---+
>>> |145605062|145605062|1  |
>>> +-+-+---+
>>>
>>> root
>>>  |-- start_time: timestamp (nullable = true)
>>>  |-- end_time: timestamp (nullable = true)
>>>  |-- id: long (nullable = true)
>>>
>>> +--+--+---+
>>> |start_time|end_time  |id |
>>> +--+--+---+
>>> 

Re: How to accelerate reading json file?

2016-01-05 Thread VISHNU SUBRAMANIAN
HI ,

You can try this

sqlContext.read.format("json").option("samplingRatio","0.1").load("path")

If it still takes time , feel free to experiment with the samplingRatio.

Thanks,
Vishnu

On Wed, Jan 6, 2016 at 12:43 PM, Gavin Yue  wrote:

> I am trying to read json files following the example:
>
> val path = "examples/src/main/resources/jsonfile"val people = 
> sqlContext.read.json(path)
>
> I have 1 Tb size files in the path.  It took 1.2 hours to finish the reading 
> to infer the schema.
>
> But I already know the schema. Could I make this process short?
>
> Thanks a lot.
>
>
>
>


Re: custom schema in spark throwing error

2015-12-21 Thread VISHNU SUBRAMANIAN
Try this

val customSchema = StructType(Array(
StructField("year", IntegerType, true),
StructField("make", StringType, true),
StructField("model", StringType, true)
))

On Mon, Dec 21, 2015 at 8:26 AM, Divya Gehlot 
wrote:

>
>1. scala> import org.apache.spark.sql.hive.HiveContext
>2. import org.apache.spark.sql.hive.HiveContext
>3.
>4. scala> import org.apache.spark.sql.hive.orc._
>5. import org.apache.spark.sql.hive.orc._
>6.
>7. scala> import org.apache.spark.sql.types.{StructType, StructField,
>StringType, IntegerType};
>8. import org.apache.spark.sql.types.{StructType, StructField,
>StringType, IntegerType}
>9.
>10. scala> val hiveContext = new org.apache.spark.sql.hive.HiveContext(
>sc)
>11. 15/12/21 02:06:24 WARN SparkConf: The configuration key
>'spark.yarn.applicationMaster.waitTries' has been deprecated as of
>Spark 1.3 and and may be r
>12. emoved in the future. Please use the new key
>'spark.yarn.am.waitTime' instead.
>13. 15/12/21 02:06:24 INFO HiveContext: Initializing execution hive,
>version 0.13.1
>14. hiveContext: org.apache.spark.sql.hive.HiveContext = org.apache.
>spark.sql.hive.HiveContext@74cba4b
>15.
>16.
>17. scala> val customSchema = StructType(Seq(StructField("year",
>IntegerType, true),StructField("make", StringType, true),StructField(
>"model", StringType
>18. , true),StructField("comment", StringType, true),StructField(
>"blank", StringType, true)))
>19. customSchema: org.apache.spark.sql.types.StructType = StructType(
>StructField(year,IntegerType,true), StructField(make,StringType,true),
>StructField(m
>20. odel,StringType,true), StructField(comment,StringType,true),
>StructField(blank,StringType,true))
>21.
>22. scala> val customSchema = (new StructType).add("year", IntegerType,
>true).add("make", StringType, true).add("model", StringType, true).add(
>"comment",
>23. StringType, true).add("blank", StringType, true)
>24. :24: error: not enough arguments for constructor StructType: (
>fields: Array[org.apache.spark.sql.types.StructField])org.apache.spark.
>sql.typ
>25. es.StructType.
>26. Unspecified value parameter fields.
>27. val customSchema = (new StructType).add("year", IntegerType, true).
>add("make", StringType, true).add("model", StringType, true).add(
>"comment",
>28. StringType, true).add("blank", StringType, true)
>
>


How VectorIndexer works in Spark ML pipelines

2015-10-15 Thread VISHNU SUBRAMANIAN
HI All,

I am trying to use the VectorIndexer (FeatureExtraction) technique
available from the Spark ML Pipelines.

I ran the example in the documentation .

val featureIndexer = new VectorIndexer()
  .setInputCol("features")
  .setOutputCol("indexedFeatures")
  .setMaxCategories(4)
  .fit(data)


And then I wanted to see what output it generates.

After performing transform on the data set , the output looks like below.

scala> predictions.select("indexedFeatures").take(1).foreach(println)

[(692,[124,125,126,127,151,152,153,154,155,179,180,181,182,183,208,209,210,211,235,236,237,238,239,263,264,265,266,267,268,292,293,294,295,296,321,322,323,324,349,350,351,352,377,378,379,380,405,406,407,408,433,434,435,436,461,462,463,464,489,490,491,492,493,517,518,519,520,521,545,546,547,548,549,574,575,576,577,578,602,603,604,605,606,630,631,632,633,634,658,659,660,661,662],[145.0,255.0,211.0,31.0,32.0,237.0,253.0,252.0,71.0,11.0,175.0,253.0,252.0,71.0,144.0,253.0,252.0,71.0,16.0,191.0,253.0,252.0,71.0,26.0,221.0,253.0,252.0,124.0,31.0,125.0,253.0,252.0,252.0,108.0,253.0,252.0,252.0,108.0,255.0,253.0,253.0,108.0,253.0,252.0,252.0,108.0,253.0,252.0,252.0,108.0,253.0,252.0,252.0,108.0,255.0,253.0,253.0,170.0,253.0,252.0,252.0,252.0,42.0,149.0,252.0,252.0,252.0,144.0,109.0,252.0,252.0,252.0,144.0,218.0,253.0,253.0,255.0,35.0,175.0,252.0,252.0,253.0,35.0,73.0,252.0,252.0,253.0,35.0,31.0,211.0,252.0,253.0,35.0])]


scala> predictions.select("features").take(1).foreach(println)

[(692,[124,125,126,127,151,152,153,154,155,179,180,181,182,183,208,209,210,211,235,236,237,238,239,263,264,265,266,267,268,292,293,294,295,296,321,322,323,324,349,350,351,352,377,378,379,380,405,406,407,408,433,434,435,436,461,462,463,464,489,490,491,492,493,517,518,519,520,521,545,546,547,548,549,574,575,576,577,578,602,603,604,605,606,630,631,632,633,634,658,659,660,661,662],[145.0,255.0,211.0,31.0,32.0,237.0,253.0,252.0,71.0,11.0,175.0,253.0,252.0,71.0,144.0,253.0,252.0,71.0,16.0,191.0,253.0,252.0,71.0,26.0,221.0,253.0,252.0,124.0,31.0,125.0,253.0,252.0,252.0,108.0,253.0,252.0,252.0,108.0,255.0,253.0,253.0,108.0,253.0,252.0,252.0,108.0,253.0,252.0,252.0,108.0,253.0,252.0,252.0,108.0,255.0,253.0,253.0,170.0,253.0,252.0,252.0,252.0,42.0,149.0,252.0,252.0,252.0,144.0,109.0,252.0,252.0,252.0,144.0,218.0,253.0,253.0,255.0,35.0,175.0,252.0,252.0,253.0,35.0,73.0,252.0,252.0,253.0,35.0,31.0,211.0,252.0,253.0,35.0])]

I can,t understand what is happening. I tried with simple data sets also ,
but similar result.

Please help.

Thanks,

Vishnu


Re: UDF in spark

2015-07-08 Thread VISHNU SUBRAMANIAN
HI Vinod,

Yes If you want to use a scala or python function you need the block of
code.

Only Hive UDF's are available permanently.

Thanks,
Vishnu

On Wed, Jul 8, 2015 at 5:17 PM, vinod kumar vinodsachin...@gmail.com
wrote:

 Thanks Vishnu,

 When restart the service the UDF was not accessible by my query.I need to
 run the mentioned block again to use the UDF.
 Is there is any way to maintain UDF in sqlContext permanently?

 Thanks,
 Vinod

 On Wed, Jul 8, 2015 at 7:16 AM, VISHNU SUBRAMANIAN 
 johnfedrickena...@gmail.com wrote:

 Hi,

 sqlContext.udf.register(udfname, functionname _)

 example:

 def square(x:Int):Int = { x * x}

 register udf as below

 sqlContext.udf.register(square,square _)

 Thanks,
 Vishnu

 On Wed, Jul 8, 2015 at 2:23 PM, vinod kumar vinodsachin...@gmail.com
 wrote:

 Hi Everyone,

 I am new to spark.may I know how to define and use User Define Function
 in SPARK SQL.

 I want to use defined UDF by using sql queries.

 My Environment
 Windows 8
 spark 1.3.1

 Warm Regards,
 Vinod







Re: UDF in spark

2015-07-08 Thread VISHNU SUBRAMANIAN
Hi,

sqlContext.udf.register(udfname, functionname _)

example:

def square(x:Int):Int = { x * x}

register udf as below

sqlContext.udf.register(square,square _)

Thanks,
Vishnu

On Wed, Jul 8, 2015 at 2:23 PM, vinod kumar vinodsachin...@gmail.com
wrote:

 Hi Everyone,

 I am new to spark.may I know how to define and use User Define Function in
 SPARK SQL.

 I want to use defined UDF by using sql queries.

 My Environment
 Windows 8
 spark 1.3.1

 Warm Regards,
 Vinod





Re: used cores are less then total no. of core

2015-02-24 Thread VISHNU SUBRAMANIAN
Try adding --total-executor-cores 5 , where 5 is the number of cores.

Thanks,
Vishnu

On Wed, Feb 25, 2015 at 11:52 AM, Somnath Pandeya 
somnath_pand...@infosys.com wrote:

  Hi All,



 I am running a simple word count example of spark (standalone cluster) ,
 In the UI it is showing

 For each worker no. of cores available are 32 ,but while running the jobs
 only 5 cores are being used,



 What should I do to increase no. of used core or it is selected based on
 jobs.



 Thanks

 Somnaht

  CAUTION - Disclaimer *
 This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely
 for the use of the addressee(s). If you are not the intended recipient, please
 notify the sender by e-mail and delete the original message. Further, you are 
 not
 to copy, disclose, or distribute this e-mail or its contents to any other 
 person and
 any such actions are unlawful. This e-mail may contain viruses. Infosys has 
 taken
 every reasonable precaution to minimize this risk, but is not liable for any 
 damage
 you may sustain as a result of any virus in this e-mail. You should carry out 
 your
 own virus checks before opening the e-mail or attachment. Infosys reserves the
 right to monitor and review the content of all messages sent to or from this 
 e-mail
 address. Messages sent to or from this e-mail address may be stored on the
 Infosys e-mail system.
 ***INFOSYS End of Disclaimer INFOSYS***




Re: Running Example Spark Program

2015-02-22 Thread VISHNU SUBRAMANIAN
Try restarting your Spark cluster .
./sbin/stop-all.sh
./sbin/start-all.sh

Thanks,
Vishnu

On Sun, Feb 22, 2015 at 7:30 PM, Surendran Duraisamy 
2013ht12...@wilp.bits-pilani.ac.in wrote:

  Hello All,

 I am new to Apache Spark, I am trying to run JavaKMeans.java from Spark
 Examples in my Ubuntu System.

 I downloaded spark-1.2.1-bin-hadoop2.4.tgz
 http://www.apache.org/dyn/closer.cgi/spark/spark-1.2.1/spark-1.2.1-bin-hadoop2.4.tgz
 and started sbin/start-master.sh

 After starting Spark and access http://localhost:8080/ to look at the
 status of my Spark Instance, and it shows as follows.


- *URL:* spark://vm:7077
- *Workers:* 0
- *Cores:* 0 Total, 0 Used
- *Memory:* 0.0 B Total, 0.0 B Used
- *Applications:* 0 Running, 4 Completed
- *Drivers:* 0 Running, 0 Completed
- *Status:* ALIVE

 Number of Cores is 0 and Memory is 0.0B. I think because of this I am
 getting following error when I try to run JavaKMeans.java

 Initial job has not accepted any resources; check your cluster UI to
 ensure that workers are registered and have sufficient memory

 Am I missing any configuration before running sbin/start-master.sh?
  Regards,
 Surendran



Re: Hive/Hbase for low latency

2015-02-11 Thread VISHNU SUBRAMANIAN
Hi Siddarth,

It depends on what you are trying to solve. But the connectivity for
cassandra and spark is good .

The answer depends upon what exactly you are trying to solve.

Thanks,
Vishnu

On Wed, Feb 11, 2015 at 7:47 PM, Siddharth Ubale 
siddharth.ub...@syncoms.com wrote:

  Hi ,



 I am new to Spark . We have recently moved from Apache Storm to Apache
 Spark to build our OLAP tool .

 Now ,earlier we were using Hbase  Phoenix.

 We need to re-think what to use in case of Spark.

 Should we go ahead with Hbase or Hive or Cassandra for query processing
 with Spark Sql.



 Please share ur views.



 Thanks,

 Siddharth Ubale,







Re: Question related to Spark SQL

2015-02-11 Thread VISHNU SUBRAMANIAN
I dint mean that. When you try the above approach only one client will have
access to the cached data.

But when you expose your data through a thrift server the case is quite
different.

In the case of thrift server all the request goes to the thrift server and
spark will be able to take the advantage of caching.

That is Thrift server be your sole client to the spark cluster.

check this link
http://spark.apache.org/docs/1.1.0/sql-programming-guide.html#running-the-thrift-jdbc-server

Your applications can connect to your spark cluster through jdbc driver.It
works similar to your hive thrift server.

Thanks,
Vishnu

On Wed, Feb 11, 2015 at 10:31 PM, Ashish Mukherjee 
ashish.mukher...@gmail.com wrote:

 Thanks for your reply, Vishnu.

 I assume you are suggesting I build Hive tables and cache them in memory
 and query on top of that for fast, real-time querying.

 Perhaps, I should write a generic piece of code like this and submit this
 as a Spark job with the SQL clause as an argument based on user selections
 on the Web interface -

 String sqlClause = args[0];
 ...
 JavaHiveContext sqlContext = new 
 org.apache.spark.sql.hive.api.java.HiveContext(sc);// Queries are expressed 
 in HiveQL.Row[] results = sqlContext.sql(sqlClause).collect();


 Is my understanding right?

 Regards,
 Ashish

 On Wed, Feb 11, 2015 at 4:42 PM, VISHNU SUBRAMANIAN 
 johnfedrickena...@gmail.com wrote:

 Hi Ashish,

 In order to answer your question , I assume that you are planning to
 process data and cache them in the memory.If you are using a thrift server
 that comes with Spark then you can query on top of it. And multiple
 applications can use the cached data as internally all the requests go to
 thrift server.

 Spark exposes hive query language and allows you access its data through
 spark .So you can consider using HiveQL for querying .

 Thanks,
 Vishnu

 On Wed, Feb 11, 2015 at 4:12 PM, Ashish Mukherjee 
 ashish.mukher...@gmail.com wrote:

 Hi,

 I am planning to use Spark for a Web-based adhoc reporting tool on
 massive date-sets on S3. Real-time queries with filters, aggregations and
 joins could be constructed from UI selections.

 Online documentation seems to suggest that SharkQL is deprecated and
 users should move away from it.  I understand Hive is generally not used
 for real-time querying and for Spark SQL to work with other data stores,
 tables need to be registered explicitly in code. Also, the This would not
 be suitable for a dynamic query construction scenario.

 For a real-time , dynamic querying scenario like mine what is the proper
 tool to be used with Spark SQL?

 Regards,
 Ashish






Re: Re: How can I read this avro file using spark scala?

2015-02-11 Thread VISHNU SUBRAMANIAN
Check this link.
https://github.com/databricks/spark-avro

Home page for Spark-avro project.

Thanks,
Vishnu

On Wed, Feb 11, 2015 at 10:19 PM, Todd bit1...@163.com wrote:

 Databricks provides a sample code on its website...but i can't find it for
 now.






 At 2015-02-12 00:43:07, captainfranz captainfr...@gmail.com wrote:
 I am confused as to whether avro support was merged into Spark 1.2 or it is
 still an independent library.
 I see some people writing sqlContext.avroFile similarly to jsonFile but this
 does not work for me, nor do I see this in the Scala docs.
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/How-can-I-read-this-avro-file-using-spark-scala-tp19400p21601.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
 




Re: getting the cluster elements from kmeans run

2015-02-11 Thread VISHNU SUBRAMANIAN
You can use model.predict(point) that will help you identify the cluster
center and map it to the point.

rdd.map(x = (x,model.predict(x)))

Thanks,
Vishnu

On Wed, Feb 11, 2015 at 11:06 PM, Harini Srinivasan har...@us.ibm.com
wrote:

 Hi,

 Is there a way to get the elements of each cluster after running kmeans
 clustering? I am using the Java version.


 
 thanks




Re: NaiveBayes classifier causes ShuffleDependency class cast exception

2015-02-06 Thread VISHNU SUBRAMANIAN
Can you try creating just a single spark context  and then try your code.
If you want to use it for streaming pass the same sparkcontext object
instead of conf.

Note: Instead of just replying to me , try to use reply to all so that the
post is visible for the community . That way you can expect immediate
responses.

On Fri, Feb 6, 2015 at 6:09 AM, aanilpala aanilp...@gmail.com wrote:

 I have the following code:


 SparkConf conf = new
 SparkConf().setAppName(streamer).setMaster(local[2]);
 conf.set(spark.driver.allowMultipleContexts, true);
 JavaStreamingContext ssc = new JavaStreamingContext(conf, new
 Duration(batch_interval));
 ssc.checkpoint(/tmp/spark/checkpoint);

 SparkConf conf2 = new
 SparkConf().setAppName(classifier).setMaster(local[1]);
 conf2.set(spark.driver.allowMultipleContexts, true);
 JavaSparkContext sc = new JavaSparkContext(conf);

 JavaReceiverInputDStreamString stream =
 ssc.socketTextStream(localhost, );

 // String to Tuple3 Conversion
 JavaDStreamTuple3lt;Long, String, String tuple_stream =
 stream.map(new FunctionString, Tuple3lt;Long, String, String() {
  ... });

 JavaPairDStreamInteger, DictionaryEntry
 raw_dictionary_stream =
 tuple_stream.filter(new FunctionTuple3lt;Long, String,String, Boolean()
 {

 @Override
 public Boolean call(Tuple3Long, String,String
 tuple) throws Exception {
 if((tuple._1()/Time.scaling_factor %
 training_interval)  training_dur)
 NaiveBayes.train(sc.parallelize(training_set).rdd());

 return true;
 }


 }).

 I am working on a text mining project and I want to use
 NaiveBayesClassifier
 of MLlib to classify some stream items. So, I have two Spark contexts one
 of
 which is a streaming context. The call to NaiveBayes.train causes the
 following exception.

 Any ideas?


  Exception in thread main org.apache.spark.SparkException: Job aborted
 due
 to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure:
 Lost task 0.0 in stage 0.0 (TID 0, localhost):
 java.lang.ClassCastException:
 org.apache.spark.SparkContext$$anonfun$runJob$4 cannot be cast to
 org.apache.spark.ShuffleDependency
 at
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:60)
 at
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
 at org.apache.spark.scheduler.Task.run(Task.scala:56)
 at
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
 at

 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at

 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)

 Driver stacktrace:
 at
 org.apache.spark.scheduler.DAGScheduler.org
 $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1214)
 at

 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1203)
 at

 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1202)
 at

 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at
 scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1202)
 at

 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:696)
 at

 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:696)
 at scala.Option.foreach(Option.scala:236)
 at

 org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:696)
 at

 org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1420)
 at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
 at

 org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundReceive(DAGScheduler.scala:1375)
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
 at akka.actor.ActorCell.invoke(ActorCell.scala:487)
 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
 at akka.dispatch.Mailbox.run(Mailbox.scala:220)
 at

 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
 at
 scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 at

 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 at
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
 at

 

Re: Shuffle Dependency Casting error

2015-02-05 Thread VISHNU SUBRAMANIAN
Hi,

Could you share the code snippet.

Thanks,
Vishnu

On Thu, Feb 5, 2015 at 11:22 PM, aanilpala aanilp...@gmail.com wrote:

 Hi, I am working on a text mining project and I want to use
 NaiveBayesClassifier of MLlib to classify some stream items. So, I have two
 Spark contexts one of which is a streaming context. Everything looks fine
 if
 I comment out train and predict methods, it works fine although doesn't
 obviously do what I want. The exception (and its trace) I am getting is
 below.

 Any ideas?

 Exception in thread main org.apache.spark.SparkException: Job aborted due
 to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure:
 Lost task 0.0 in stage 0.0 (TID 0, localhost):
 java.lang.ClassCastException:
 org.apache.spark.SparkContext$$anonfun$runJob$4 cannot be cast to
 org.apache.spark.ShuffleDependency
 at
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:60)
 at
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
 at org.apache.spark.scheduler.Task.run(Task.scala:56)
 at
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
 at

 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at

 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)

 Driver stacktrace:
 at
 org.apache.spark.scheduler.DAGScheduler.org
 $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1214)
 at

 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1203)
 at

 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1202)
 at

 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at
 scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1202)
 at

 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:696)
 at

 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:696)
 at scala.Option.foreach(Option.scala:236)
 at

 org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:696)
 at

 org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1420)
 at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
 at

 org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundReceive(DAGScheduler.scala:1375)
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
 at akka.actor.ActorCell.invoke(ActorCell.scala:487)
 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
 at akka.dispatch.Mailbox.run(Mailbox.scala:220)
 at

 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
 at
 scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 at

 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 at
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
 at

 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Shuffle-Dependency-Casting-error-tp21518.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Java Kafka Word Count Issue

2015-02-02 Thread VISHNU SUBRAMANIAN
You can use updateStateByKey() to perform the above operation.

On Mon, Feb 2, 2015 at 4:29 PM, Jadhav Shweta jadhav.shw...@tcs.com wrote:


 Hi Sean,

 Kafka Producer is working fine.
 This is related to Spark.

 How can i configure spark so that it will make sure to remember count from
 the beginning.

 If my log.text file has

 spark
 apache
 kafka
 spark

 My Spark program gives correct output as

 spark 2
 apache 1
 kafka 1

 but when I append spark to my log.text file

 Spark program gives output as

 spark 1

 which should be spark 3.

 So how to handle this in Spark code.

 Thanks and regards
 Shweta Jadhav



 -Sean Owen so...@cloudera.com wrote: -
 To: Jadhav Shweta jadhav.shw...@tcs.com
 From: Sean Owen so...@cloudera.com
 Date: 02/02/2015 04:13PM
 Subject: Re: Java Kafka Word Count Issue

 This is a question about the Kafka producer right? Not Spark
 On Feb 2, 2015 10:34 AM, Jadhav Shweta jadhav.shw...@tcs.com wrote:


 Hi All,

 I am trying to run Kafka Word Count Program.
 please find below, the link for the same

 https://github.com/apache/spark/blob/master/examples/scala-2.10/src/main/java/org/apache/spark/examples/streaming/JavaKafkaWordCount.java

 I have set spark master to setMaster(local[*])

 and I have started Kafka Producer which reads the file.

 If my file has already few words
 then after running Spark java program I get proper output.

 But when i append new words in same file it starts word count again from
 1.

 If I need to do word count for already present and newly appended words
 exactly what changes I need to make in code for that.

 P.S. I am using Spark spark-1.2.0-bin-hadoop2.3

 Thanks and regards
 Shweta Jadhav

 =-=-=
 Notice: The information contained in this e-mail
 message and/or attachments to it may contain
 confidential or privileged information. If you are
 not the intended recipient, any dissemination, use,
 review, distribution, printing or copying of the
 information contained in this e-mail message
 and/or attachments to it are strictly prohibited. If
 you have received this communication in error,
 please notify us by reply e-mail or telephone and
 immediately and permanently delete the message
 and any attachments. Thank you




Re: Failed to save RDD as text file to local file system

2015-01-08 Thread VISHNU SUBRAMANIAN
looks like it is trying to save the file in Hdfs.

Check if you have set any hadoop path in your system.

On Fri, Jan 9, 2015 at 12:14 PM, Raghavendra Pandey 
raghavendra.pan...@gmail.com wrote:

 Can you check permissions etc as I am able to run
 r.saveAsTextFile(file:///home/cloudera/tmp/out1) successfully on my
 machine..

 On Fri, Jan 9, 2015 at 10:25 AM, NingjunWang ningjun.w...@lexisnexis.com
 wrote:

 I try to save RDD as text file to local file system (Linux) but it does
 not
 work

 Launch spark-shell and run the following

 val r = sc.parallelize(Array(a, b, c))
 r.saveAsTextFile(file:///home/cloudera/tmp/out1)


 IOException: Mkdirs failed to create

 file:/home/cloudera/tmp/out1/_temporary/0/_temporary/attempt_201501082027_0003_m_00_47
 (exists=false, cwd=file:/var/run/spark/work/app-20150108201046-0021/0)
 at

 org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:442)
 at

 org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:428)
 at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:908)
 at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:801)
 at

 org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:123)
 at
 org.apache.spark.SparkHadoopWriter.open(SparkHadoopWriter.scala:90)
 at

 org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1056)
 at

 org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1047)
 at
 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
 at org.apache.spark.scheduler.Task.run(Task.scala:56)
 at
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
 at

 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at

 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)


 I also try with 4 slash but still get the same error
 r.saveAsTextFile(file:home/cloudera/tmp/out1)

 Please advise

 Ningjun




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Failed-to-save-RDD-as-text-file-to-local-file-system-tp21050.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org