Re: Prediction using Classification with text attributes in Apache Spark MLLib

2017-10-20 Thread lmk
Trying to improve the old solution. 
Do we have a better text classifier now in Spark Mllib?

Regards,
lmk



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Bad Digest error while doing aws s3 put

2016-02-08 Thread lmk
Hi Dhimant,
As I had indicated in my next mail, my problem was due to disk getting full
with log messages (these were dumped into the slaves) and did not have
anything to do with the content pushed into s3. So, looks like this error
message is very generic and is thrown for various reasons. You may probably
have to do some more research to find out the cause of your problem..
Please keep me posted once you fix this issue. Sorry, I could not be of much
help to you..

Regards



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Bad-Digest-error-while-doing-aws-s3-put-tp10036p26174.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: SchemaRDD saveToCassandra

2014-09-16 Thread lmk
Hi Michael,
Please correct me if I am wrong. The error seems to originate from spark
only. Please have a look at the stack trace of the error which is as
follows:

[error] (run-main-0) java.lang.NoSuchMethodException: Cannot resolve any
suitable constructor for class org.apache.spark.sql.catalyst.expressions.Row
java.lang.NoSuchMethodException: Cannot resolve any suitable constructor for
class org.apache.spark.sql.catalyst.expressions.Row
at
com.datastax.spark.connector.rdd.reader.AnyObjectFactory$$anonfun$resolveConstructor$2.apply(AnyObjectFactory.scala:134)
at
com.datastax.spark.connector.rdd.reader.AnyObjectFactory$$anonfun$resolveConstructor$2.apply(AnyObjectFactory.scala:134)
at scala.util.Try.getOrElse(Try.scala:77)
at
com.datastax.spark.connector.rdd.reader.AnyObjectFactory$.resolveConstructor(AnyObjectFactory.scala:133)
at
com.datastax.spark.connector.mapper.ReflectionColumnMapper.columnMap(ReflectionColumnMapper.scala:46)
at
com.datastax.spark.connector.writer.DefaultRowWriter.init(DefaultRowWriter.scala:20)
at
com.datastax.spark.connector.writer.DefaultRowWriter$$anon$2.rowWriter(DefaultRowWriter.scala:109)
at
com.datastax.spark.connector.writer.DefaultRowWriter$$anon$2.rowWriter(DefaultRowWriter.scala:107)
at
com.datastax.spark.connector.writer.TableWriter$.apply(TableWriter.scala:171)
at
com.datastax.spark.connector.RDDFunctions.tableWriter(RDDFunctions.scala:76)
at
com.datastax.spark.connector.RDDFunctions.saveToCassandra(RDDFunctions.scala:56)
at ScalaSparkCassandra$.main(ScalaSparkCassandra.scala:66)
at ScalaSparkCassandra.main(ScalaSparkCassandra.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)

Please clarify.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-saveToCassandra-tp13951p14340.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



SchemaRDD saveToCassandra

2014-09-11 Thread lmk
Hi,
My requirement is to extract certain fields from json files, run queries on
them and save the result to cassandra.
I was able to parse json , filter the result and save the rdd(regular) to
cassandra.

Now, when I try to read the json file through sqlContext , execute some
queries on the same and then save the SchemaRDD to cassandra using
saveToCassandra function, I am getting the following error:

java.lang.NoSuchMethodException: Cannot resolve any suitable constructor for
class org.apache.spark.sql.catalyst.expressions.Row

Pls let me know if a spark SchemaRDD can be directly saved to cassandra just
like the regular rdd?

If that is not possible, is there any way to convert the schema RDD to a
regular RDD ?

Please advise.

Regards,
lmk



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-saveToCassandra-tp13951.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



NotSerializableException while doing rdd.saveToCassandra

2014-08-27 Thread lmk
Hi All,
I am using spark-1.0.0 to parse a json file and save to values to cassandra
using case class.
My code looks as follows:
case class LogLine(x1:Option[String],x2:
Option[String],x3:Option[List[String]],x4:
Option[String],x5:Option[String],x6:Option[String],x7:Option[String],x8:Option[String],x9:Option[String])

val data = test.map(line =

{
parse(line)

}).map(json = {

  // Extract the values 
  
  implicit lazy val formats = org.json4s.DefaultFormats 
  
  
  val x1 = (json \ x1).extractOpt[String]
  val x2 = (json \ x2).extractOpt[String]
  val x4=(json \ x4).extractOpt[String]
  val x5=(json \ x5).extractOpt[String]
  val x6=(json \ x6).extractOpt[String]
  val x7=(json \ x7).extractOpt[String] 
  val x8=(json \ x8).extractOpt[String]
  val x3=(json \ x3).extractOpt[List[String]]
  val x9=(json \ x9).extractOpt[String]
  
LogLine(x1,x2,x3,x4,x5,x6,x7,x8,x9) 
})

data.saveToCassandra(test, test_data, Seq(x1, x2, x3, x4, x5,
x6, x7, x8, x9))

whereas the cassandra table schema is as follows:
CREATE TABLE test_data (
x1 varchar,
x2 varchar,
x4 varchar,
x5 varchar,
x6 varchar,
x7 varchar,
x8 varchar,
x3 listtext ,
x9 varchar,
PRIMARY KEY (x1));

I am getting the following error on executing the saveToCassandra statement:

14/08/27 11:33:59 INFO SparkContext: Starting job: runJob at
package.scala:169
14/08/27 11:33:59 INFO DAGScheduler: Got job 5 (runJob at package.scala:169)
with 1 output partitions (allowLocal=false)
14/08/27 11:33:59 INFO DAGScheduler: Final stage: Stage 5(runJob at
package.scala:169)
14/08/27 11:33:59 INFO DAGScheduler: Parents of final stage: List()
14/08/27 11:33:59 INFO DAGScheduler: Missing parents: List()
14/08/27 11:33:59 INFO DAGScheduler: Submitting Stage 5 (MappedRDD[7] at map
at console:45), which has no missing parents
14/08/27 11:33:59 INFO DAGScheduler: Failed to run runJob at
package.scala:169
org.apache.spark.SparkException: Job aborted due to stage failure: Task not
serializable: java.io.NotSerializableException: org.apache.spark.SparkConf

data.saveToCassandra(test, test_data, Seq(x1, x2, x3, x4, x5,
x6, x7, x8, x9))

Here the data field is org.apache.spark.rdd.RDD[LogLine] = MappedRDD[7] at
map at console:45

How can I convert this to Serializable, or is this a different problem?
Please advise.

Regards,
lmk



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/NotSerializableException-while-doing-rdd-saveToCassandra-tp12906.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Apache Spark- Cassandra - NotSerializable Exception while saving to cassandra

2014-08-27 Thread lmk
Hi Yana
I have done take and confirmed existence of data..Also checked that it is
getting connected to Cassandra.. That is why I suspect that this particular
rdd is not serializable..
Thanks,
Lmk
On Aug 28, 2014 5:13 AM, Yana [via Apache Spark User List] 
ml-node+s1001560n12960...@n3.nabble.com wrote:

 I'm not so sure that your error is coming from the cassandra write.

 you have val data = test.map(..).map(..)

 so data will actually not get created until you try to save it. Can you
 try to do something like data.count() or data.take(k) after this line and
 see if you even get to the cassandra part? My suspicion is that you're
 trying to access something (SparkConf?) within the map closures...

 --
  If you reply to this email, your message will be added to the discussion
 below:

 http://apache-spark-user-list.1001560.n3.nabble.com/Apache-Spark-Cassandra-NotSerializable-Exception-while-saving-to-cassandra-tp12906p12960.html
  To unsubscribe from Apache Spark- Cassandra - NotSerializable Exception
 while saving to cassandra, click here
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=12906code=bGFrc2htaS5tdXJhbGlrcmlzaG5hbkBnbWFpbC5jb218MTI5MDZ8LTEzOTI0NzEwNjA=
 .
 NAML
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Apache-Spark-Cassandra-NotSerializable-Exception-while-saving-to-cassandra-tp12906p12984.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Bad Digest error while doing aws s3 put

2014-08-07 Thread lmk
This was a completely misleading error message..

The problem was due to a log message getting dumped to the stdout. This was
getting accumulated in the workers and hence there was no space left on
device after some time. 
When I re-tested with spark-0.9.1, the saveAsTextFile api threw no space
left on device error after writing the same 48 files. On checking the
master, it was all ok.
But on checking the slaves, the stdout contributed to 99% of the root
filesystem.
On removing the particular log, it is now working fine in both the versions.




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Bad-Digest-error-while-doing-aws-s3-put-tp10036p11642.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Bad Digest error while doing aws s3 put

2014-08-05 Thread lmk
Is it possible that the Content-MD5 changes during multipart upload to s3?
But even then, it succeeds if I increase the cluster configuration..

For ex.
it throws Bad Digest error after writing 48/100 files when the cluster is of
3 m3.2xlarge slaves
it throws Bad Digest error after writing 64/100 files when the cluster is of
4 m3.2xlarge slaves
it throws Bad Digest error after writing 86/100 files when the cluster is of
5 m3.2xlarge slaves
it succeeds writing all the 100 files when the cluster is of 6 m3.2xlarge
slaves..

Please clarify.

Regards,
lmk




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Bad-Digest-error-while-doing-aws-s3-put-tp10036p11421.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Bad Digest error while doing aws s3 put

2014-08-04 Thread lmk
Thanks Patrick. 

But why am I getting a Bad Digest error when I am saving large amount of
data to s3? 

/Loss was due to org.apache.hadoop.fs.s3.S3Exception
org.apache.hadoop.fs.s3.S3Exception: org.jets3t.service.S3ServiceException:
S3 PUT failed for
'/spark_test%2Fsmaato_one_day_phase_2%2Fsmaato_2014_05_17%2F_temporary%2F_attempt_201408041624__m_65_165%2Fpart-00065'
XML Error Message: ?xml version=1.0
encoding=UTF-8?ErrorCodeBadDigest/CodeMessageThe Content-MD5 you
specified did not match what we
received./MessageExpectedDigestlb2tDEVSSnRNM4pw6504Bg==/ExpectedDigestCalculatedDigestEL9UDBzFvTwJycA7Ii2KGA==/CalculatedDigestRequestId437F15C89D355081/RequestIdHostIdkJQI+c9edzBmT2Z9sbfAELYT/8R5ezLWeUgeIU37iPsq5KQm/qAXItunZY35wnYx/HostId/Error
at
org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.storeFile(Jets3tNativeFileSystemStore.java:82)/

As indicated earlier, I use the following command as an alternative to
saveAsTextFile:

/x.map(x = (NullWritable.get(), new
Text(x.toString))).coalesce(100).saveAsHadoopFile[TextOutputFormat[NullWritable,
Text]](s3n://dest-dir/)

In the above case, it succeeds till it writes some 48 part files out of 100
(but this 48 also is inconsistent) and then starts throwing the above error.
The  same works well if I increase the capacity of the cluster (say from 3
m3.2xlarge slaves to 6), or reduce the data size. 

Is there a possibility that the data is getting corrupt when the load
increases? 

Please advice. I am stuck with this problem for the past couple of weeks.

Thanks,
lmk



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Bad-Digest-error-while-doing-aws-s3-put-tp10036p11345.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Bad Digest error while doing aws s3 put

2014-07-28 Thread lmk
Hi
I was using saveAsTextFile earlier. It was working fine. When we migrated to
spark-1.0, I started getting the following error:
java.lang.ClassNotFoundException:
org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1
java.net.URLClassLoader$1.run(URLClassLoader.java:366)
java.net.URLClassLoader$1.run(URLClassLoader.java:355)

Hence I changed my code as follows:

x.map(x = (NullWritable.get(), new
Text(x.toString))).saveAsHadoopFile[TextOutputFormat[NullWritable,
Text]](path)

After this I am facing this problem when I write very huge data to s3. This
also occurs while writing to some partitions only, say while writing to 240
partitions, it might succeed for 156 files and then it will start throwing
the Bad Digest Error and then it hangs.

Please advise.

Regards,
lmk



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Bad-Digest-error-while-doing-aws-s3-put-tp10036p10780.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


save to HDFS

2014-07-24 Thread lmk
Hi,
I have a scala application which I have launched into a spark cluster. I
have the following statement trying to save to a folder in the master:
saveAsHadoopFile[TextOutputFormat[NullWritable,
Text]](hdfs://masteripaddress:9000/root/test-app/test1/)

The application is executed successfully and log says that save is complete
also. But I am not able to find the file I have saved anywhere. Is there a
way I can access this file?

Pls advice.

Regards,
lmk



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/save-to-HDFS-tp10578.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: save to HDFS

2014-07-24 Thread lmk
Hi Akhil,
I am sure that the RDD that I saved is not empty. I have tested it using
take.
But is there no way that I can see this saved physically like we do in the
normal context? Can't I view this folder as I am already logged into the
cluster?
And, should I run hadoop fs -ls
hdfs://masteripaddress:9000/root/test-app/test1/
after I login to the cluster?

Thanks,
lmk



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/save-to-HDFS-tp10578p10581.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: save to HDFS

2014-07-24 Thread lmk
Thanks Akhil.
I was able to view the files. Actually I was trying to list the same using
regular ls and since it did not show anything I was concerned.
Thanks for showing me the right direction.

Regards,
lmk



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/save-to-HDFS-tp10578p10583.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Bad Digest error while doing aws s3 put

2014-07-17 Thread lmk
Hi,
I am getting the following error while trying save a large dataset to s3
using the saveAsHadoopFile command with apache spark-1.0.
org.apache.hadoop.fs.s3.S3Exception: org.jets3t.service.S3ServiceException:
S3 PUT failed for
'/spark_test%2Fsmaato_one_day_phase_2%2Fsmaato_2014_05_17%2F_temporary%2F_attempt_201407170658__m_36_276%2Fpart-00036'
XML Error Message: ?xml version=1.0
encoding=UTF-8?ErrorCodeBadDigest/CodeMessageThe Content-MD5 you
specified did not match what we
received./MessageExpectedDigestN808DtNfYiTFzI+i2HxLEw==/ExpectedDigestCalculatedDigest66nS+2C1QqQmmcTeFpXOjw==/CalculatedDigestRequestId4FB3A3D60B187CE7/RequestIdHostIdH2NznP+RvwspekVHBMvgWGYAupKuO5YceSgmiLym6rOajOh5v5GnyM0VkO+dadyG/HostId/Error

I have used the same command to write similar content with lesser data to s3
without any problem. When I googled this error message, they say it might be
due to md5 checksum mismatch. But will this happen due to load?

Regards,
lmk 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Bad-Digest-error-while-doing-aws-s3-put-tp10036.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Prediction using Classification with text attributes in Apache Spark MLLib

2014-06-24 Thread lmk
Hi,
I am trying to predict an attribute with binary value (Yes/No) using SVM.
All my attributes which belong to the training set are text attributes. 
I understand that I have to convert my outcome as double (0.0/1.0). But I
donot understand how to deal with my explanatory variables which are also
text.
Please let me know how I can do this.

Thanks.





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Prediction-using-Classification-with-text-attributes-in-Apache-Spark-MLLib-tp8166.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


RE: Prediction using Classification with text attributes in Apache Spark MLLib

2014-06-24 Thread lmk
Hi Alexander,
Thanks for your prompt response. Earlier I was executing this Prediction
using Weka only. But now we are moving to a huge dataset and hence to Apache
Spark MLLib. Is there any other way to convert to libSVM format? Or is there
any other simpler algorithm that I can use in mllib?

Thanks,
lmk



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Prediction-using-Classification-with-text-attributes-in-Apache-Spark-MLLib-tp8166p8168.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Can this be done in map-reduce technique (in parallel)

2014-06-05 Thread lmk
Hi Cheng,
Sorry Again.

In this method, i see that the values for 
  a - positions.iterator 
  b - positions.iterator

always remain the same. I tried to do a  b - positions.iterator.next, it
throws an  error: value filter is not a member of (Double, Double)

Is there something I am missing out here?

Thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Can-this-be-handled-in-map-reduce-using-RDDs-tp6905p7033.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Can this be done in map-reduce technique (in parallel)

2014-06-05 Thread lmk
Hi Cheng,
Thanks a lot. That solved my problem.

Thanks again for the quick response and solution.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Can-this-be-handled-in-map-reduce-using-RDDs-tp6905p7047.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Can this be done in map-reduce technique (in parallel)

2014-06-04 Thread lmk
Hi,
I am a new spark user. Pls let me know how to handle the following scenario:

I have a data set with the following fields:
1. DeviceId
2. latitude
3. longitude
4. ip address
5. Datetime
6. Mobile application name

With the above data, I would like to perform the following steps:
1. Collect all lat and lon for each ipaddress 
(ip1,(lat1,lon1),(lat2,lon2))
(ip2,(lat3,lon3),(lat4,lat5))
2. For each IP, 
1.Find the distance between each lat and lon coordinate pair and all
the other pairs under the same IP 
2.Select those coordinates whose distances fall under a specific
threshold (say 100m) 
3.Find the coordinate pair with the maximum occurrences 

In this case, how can I iterate and compare each coordinate pair with all
the other pairs? 
Can this be done in a distributed manner, as this data set is going to have
a few million records? 
Can we do this in map/reduce commands?

Thanks.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Can-this-be-done-in-map-reduce-technique-in-parallel-tp6905.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Can this be done in map-reduce technique (in parallel)

2014-06-04 Thread lmk
Hi Oleg/Andrew,
Thanks much for the prompt response.

We expect thousands of lat/lon pairs for each IP address. And that is my
concern with the Cartesian product approach. 
Currently for a small sample of this data (5000 rows) I am grouping by IP
address and then computing the distance between lat/lon coordinates using
array manipulation techniques. 
But I understand this approach is not right when the data volume goes up.
My code is as follows:

val dataset:RDD[String] = sc.textFile(x.csv)
val data = dataset.map(l=l.split(,))
val grpData = data.map(r =
(r(3),((r(1).toDouble),r(2).toDouble))).groupByKey() 

Now, I have the data grouped by ipaddress as Array[(String,
Iterable[(Double, Double)])]
ex.. 
 Array((ip1,ArrayBuffer((lat1,lon1), (lat2,lon2), (lat3,lon3)))

Now I have to find the distance between (lat1,lon1) and (lat2,lon2) and then
between (lat1,lon1) and (lat3,lon3) and so on for all combinations.

This is where I get stuck. Please guide me on this.

Thanks Again.
 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Can-this-be-handled-in-map-reduce-using-RDDs-tp6905p7016.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.