Re: Pyspark DataFrame TypeError

2015-09-08 Thread Prabeesh K.
Thanks for the reply. after rebuild now it looks good.

On 8 September 2015 at 22:38, Davies Liu  wrote:

> I tried with Python 2.7/3.4 and Spark 1.4.1/1.5-RC3, they all work as
> expected:
>
> ```
> >>> from pyspark.mllib.linalg import Vectors
> >>> df = sqlContext.createDataFrame([(1.0, Vectors.dense([1.0])), (0.0,
> Vectors.sparse(1, [], []))], ["label", "featuers"])
> >>> df.show()
> +-+-+
> |label| featuers|
> +-+-+
> |  1.0|[1.0]|
> |  0.0|(1,[],[])|
> +-+-----+
>
> >>> df.columns
> ['label', 'featuers']
> ```
>
> On Tue, Sep 8, 2015 at 1:45 AM, Prabeesh K.  wrote:
> > I am trying to run the code RandomForestClassifier example in the PySpark
> > 1.4.1 documentation,
> >
> https://spark.apache.org/docs/1.4.1/api/python/pyspark.ml.html#pyspark.ml.classification.RandomForestClassifier
> .
> >
> > Below is screen shot of ipython notebook
> >
> >
> >
> > But for df.columns. It shows following error.
> >
> >
> > TypeError Traceback (most recent call
> last)
> >  in ()
> > > 1 df.columns
> >
> > /home/datasci/src/spark/python/pyspark/sql/dataframe.pyc in columns(self)
> > 484 ['age', 'name']
> > 485 """
> > --> 486 return [f.name for f in self.schema.fields]
> > 487
> > 488 @ignore_unicode_prefix
> >
> > /home/datasci/src/spark/python/pyspark/sql/dataframe.pyc in schema(self)
> > 194 """
> > 195 if self._schema is None:
> > --> 196 self._schema =
> > _parse_datatype_json_string(self._jdf.schema().json())
> > 197 return self._schema
> > 198
> >
> > /home/datasci/src/spark/python/pyspark/sql/types.pyc in
> > _parse_datatype_json_string(json_string)
> > 519 >>> check_datatype(structtype_with_udt)
> > 520 """
> > --> 521 return _parse_datatype_json_value(json.loads(json_string))
> > 522
> > 523
> >
> > /home/datasci/src/spark/python/pyspark/sql/types.pyc in
> > _parse_datatype_json_value(json_value)
> > 539 tpe = json_value["type"]
> > 540 if tpe in _all_complex_types:
> > --> 541 return _all_complex_types[tpe].fromJson(json_value)
> > 542 elif tpe == 'udt':
> > 543 return UserDefinedType.fromJson(json_value)
> >
> > /home/datasci/src/spark/python/pyspark/sql/types.pyc in fromJson(cls,
> json)
> > 386 @classmethod
> > 387 def fromJson(cls, json):
> > --> 388 return StructType([StructField.fromJson(f) for f in
> > json["fields"]])
> > 389
> > 390
> >
> > /home/datasci/src/spark/python/pyspark/sql/types.pyc in fromJson(cls,
> json)
> > 347 def fromJson(cls, json):
> > 348 return StructField(json["name"],
> > --> 349
> _parse_datatype_json_value(json["type"]),
> > 350json["nullable"],
> > 351json["metadata"])
> >
> > /home/datasci/src/spark/python/pyspark/sql/types.pyc in
> > _parse_datatype_json_value(json_value)
> > 541 return _all_complex_types[tpe].fromJson(json_value)
> > 542 elif tpe == 'udt':
> > --> 543 return UserDefinedType.fromJson(json_value)
> > 544 else:
> > 545 raise ValueError("not supported type: %s" % tpe)
> >
> > /home/datasci/src/spark/python/pyspark/sql/types.pyc in fromJson(cls,
> json)
> > 453 pyModule = pyUDT[:split]
> > 454 pyClass = pyUDT[split+1:]
> > --> 455 m = __import__(pyModule, globals(), locals(), [pyClass])
> > 456 UDT = getattr(m, pyClass)
> > 457 return UDT()
> >
> > TypeError: Item in ``from list'' not a string
> >
> >
> >
> >
> >
>


Re: jenkins failing on Kinesis shard limits

2015-07-24 Thread Prabeesh K.
For me

https://amplab.cs.berkeley.edu/jenkins/job/SlowSparkPullRequestBuilder/97/console

https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/38417/console

On 25 July 2015 at 09:57, Patrick Wendell  wrote:

> I've disabled the test and filed a JIRA:
>
> https://issues.apache.org/jira/browse/SPARK-9335
>
> On Fri, Jul 24, 2015 at 4:05 PM, Steve Loughran 
> wrote:
> >
> > Looks like Jenkins is hitting some AWS limits
> >
> >
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/38396/testReport/org.apache.spark.streaming.kinesis/KinesisBackedBlockRDDSuite/_It_is_not_a_test_/
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: not in gzip format

2015-04-07 Thread prabeesh k
but name just confusing

On 7 April 2015 at 16:35, Sean Owen  wrote:

> Er, click the link? It is indeed a redirector HTML page. This is how all
> Apache releases are served.
> On Apr 7, 2015 8:32 AM, "prabeesh k"  wrote:
>
>> Please check the apache mirror
>> http://www.apache.org/dyn/closer.cgi/spark/spark-1.3.0/spark-1.3.0.tgz
>> file. It is not in the gzip format.
>>
>


not in gzip format

2015-04-07 Thread prabeesh k
Please check the apache mirror
http://www.apache.org/dyn/closer.cgi/spark/spark-1.3.0/spark-1.3.0.tgz
file. It is not in the gzip format.


Re: Welcoming three new committers

2015-02-03 Thread prabeesh k
Congratulations!

On 4 February 2015 at 02:34, Matei Zaharia  wrote:

> Hi all,
>
> The PMC recently voted to add three new committers: Cheng Lian, Joseph
> Bradley and Sean Owen. All three have been major contributors to Spark in
> the past year: Cheng on Spark SQL, Joseph on MLlib, and Sean on ML and many
> pieces throughout Spark Core. Join me in welcoming them as committers!
>
> Matei
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Unable to execute saveAsTextFile on multi node mesos

2014-05-31 Thread prabeesh k
Hi,

scenario : Read data from HDFS and apply hive query  on it and the result
is written back to HDFS.

 Scheme creation, Querying  and saveAsTextFile are working fine with
following mode

   - local mode
   - mesos cluster with single node
   - spark cluster with multi node

Schema creation and querying are working fine with mesos multi node cluster.
But  while trying to write back to HDFS using saveAsTextFile, the following
error occurs

* 14/05/30 10:16:35 INFO DAGScheduler: The failed fetch was from Stage 4
(mapPartitionsWithIndex at Operator.scala:333); marking it for resubmission*
*14/05/30 10:16:35 INFO DAGScheduler: Executor lost:
201405291518-3644595722-5050-17933-1 (epoch 148)*

Let me know your thoughts regarding this.

Regards,
prabeesh


java.lang.OutOfMemoryError while running Shark on Mesos

2014-05-22 Thread prabeesh k
Hi,

I am trying to apply  inner join in shark using 64MB and 27MB files. I am
able to run the following queris on Mesos


   - "SELECT * FROM geoLocation1 "



   - """ SELECT * FROM geoLocation1  WHERE  country =  '"US"' """


But while trying inner join as

 "SELECT * FROM geoLocation1 g1 INNER JOIN geoBlocks1 g2 ON (g1.locId =
g2.locId)"



I am getting following error as follows.


Exception in thread "main" org.apache.spark.SparkException: Job aborted:
Task 1.0:7 failed 4 times (most recent failure: Exception failure:
java.lang.OutOfMemoryError: Java heap space)
 at
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1020)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1018)
 at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at org.apache.spark.scheduler.DAGScheduler.org
$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1018)
 at
org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604)
 at scala.Option.foreach(Option.scala:236)
at
org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:604)
 at
org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:190)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
 at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
 at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
 at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)


Please help me to resolve this.

Thanks in adv

regards,
prabeesh


Better option to use Querying in Spark

2014-05-05 Thread prabeesh k
Hi,

I have seen three different ways to query data from Spark

   1. Default SQL support(
   
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/sql/examples/HiveFromSpark.scala
   )
   2. Shark
   3. Blink DB

I would like know which one is more efficient

Regards.
prabeesh


Link not working

2014-04-21 Thread prabeesh k
For Spark-0.8.0, the download links are not working.

Please update the same

Regarding,
prabeesh


Re: [VOTE] Release Apache Spark 0.9.1 (RC3)

2014-03-30 Thread prabeesh k
+1
tested on Ubuntu12.04 64bit


On Mon, Mar 31, 2014 at 3:56 AM, Matei Zaharia wrote:

> +1 tested on Mac OS X.
>
> Matei
>
> On Mar 27, 2014, at 1:32 AM, Tathagata Das 
> wrote:
>
> > Please vote on releasing the following candidate as Apache Spark version
> 0.9.1
> >
> > A draft of the release notes along with the CHANGES.txt file is
> > attached to this e-mail.
> >
> > The tag to be voted on is v0.9.1-rc3 (commit 4c43182b):
> >
> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=4c43182b6d1b0b7717423f386c0214fe93073208
> >
> > The release files, including signatures, digests, etc. can be found at:
> > http://people.apache.org/~tdas/spark-0.9.1-rc3/
> >
> > Release artifacts are signed with the following key:
> > https://people.apache.org/keys/committer/tdas.asc
> >
> > The staging repository for this release can be found at:
> > https://repository.apache.org/content/repositories/orgapachespark-1009/
> >
> > The documentation corresponding to this release can be found at:
> > http://people.apache.org/~tdas/spark-0.9.1-rc3-docs/
> >
> > Please vote on releasing this package as Apache Spark 0.9.1!
> >
> > The vote is open until Sunday, March 30, at 10:00 UTC and passes if
> > a majority of at least 3 +1 PMC votes are cast.
> >
> > [ ] +1 Release this package as Apache Spark 0.9.1
> > [ ] -1 Do not release this package because ...
> >
> > To learn more about Apache Spark, please see
> > http://spark.apache.org/
> > 
>
>


Re: [VOTE] Release Apache Spark 0.9.1 (RC3)

2014-03-29 Thread prabeesh k
One more update in docs.

In the home page of spark streaming
http://spark.incubator.apache.org/streaming/.
Under
Deployment Options
 It is mentioned that "*Spark Streaming can read data from HDFS
<http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html>,
Flume
<http://flume.apache.org/>,Kafka <http://kafka.apache.org/>, Twitter
<https://dev.twitter.com/> and ZeroMQ <http://zeromq.org/>*".

But from Spark Streaming-0.9.0 on wards it also supports Mqtt.

Can you please do the necessary to update the same after the voting has
completed?


On Sat, Mar 29, 2014 at 9:28 PM, Tathagata Das
wrote:

> Small fixes to the docs can be done after the voting has completed. This
> should not determine the vote on the release candidate binaries. Please
> vote as "+1" if the published artifacts and binaries are good to go.
>
> TD
> On Mar 29, 2014 5:23 AM, "prabeesh k"  wrote:
>
> > https://github.com/apache/spark/blob/master/docs/quick-start.md in line
> > 127. one spelling mistake found please correct it. (proogram to program)
> >
> >
> >
> > On Fri, Mar 28, 2014 at 9:58 PM, Will Benton  wrote:
> >
> > > RC3 works with the applications I'm working on now and MLLib
> performance
> > > is indeed perceptibly improved over 0.9.0 (although I haven't done a
> real
> > > evaluation).  Also, from the downstream perspective, I've been tracking
> > the
> > > 0.9.1 RCs in Fedora and have no issues to report there either:
> > >
> > >http://koji.fedoraproject.org/koji/buildinfo?buildID=507284
> > >
> > > [x] +1 Release this package as Apache Spark 0.9.1
> > > [ ] -1 Do not release this package because ...
> > >
> >
>


Re: [VOTE] Release Apache Spark 0.9.1 (RC3)

2014-03-29 Thread prabeesh k
https://github.com/apache/spark/blob/master/docs/quick-start.md in line
127. one spelling mistake found please correct it. (proogram to program)



On Fri, Mar 28, 2014 at 9:58 PM, Will Benton  wrote:

> RC3 works with the applications I'm working on now and MLLib performance
> is indeed perceptibly improved over 0.9.0 (although I haven't done a real
> evaluation).  Also, from the downstream perspective, I've been tracking the
> 0.9.1 RCs in Fedora and have no issues to report there either:
>
>http://koji.fedoraproject.org/koji/buildinfo?buildID=507284
>
> [x] +1 Release this package as Apache Spark 0.9.1
> [ ] -1 Do not release this package because ...
>


Re: [DISCUSS] Scala Style for import

2014-03-13 Thread prabeesh k
>From SparkCodeStyleGuide
https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide#SparkCodeStyleGuide-Imports
.

can find that import style as











*Always import packages using absolute paths (e.g. scala.util.Random)
instead of relative ones (e.g. util.Random). In addition, sort imports in
the following order: - java.* and javax.* - scala.* - Third-party libraries
(org.*, com.*, etc) - Project classes (org.apache.spark.*) *

Some of the Spark code also not follow the above order for import

Additionally it is better to include some additional style for import from
http://twitter.github.io/effectivescala/#Formatting-Imports

   - *Use braces when importing several names from a package import
   import org.apache.flume.source.avro.{AvroSourceProtocol, AvroFlumeEvent,
   Staus}*


   - *Use wildcards when more than six names are imported
  import org.apache.flume.source.avro._

Don't apply this blindly: some packages export too many names*


   - *When using collections, qualify names by importing
   scala.collection.immutable and/or scala.collection.mutable

 Mutable and immutable collections have dual names.
   Qualifiying the names makes is obvious
   to the reader which variant is being used (e.g. "immutable.Map")*


   - *Do not use relative imports from other packages Avoid
**import
   org.apache.flume.source.avro.AvroSourceProtocol  **import
AvroFlumeEvent

   **in favor of the unambiguous
  *
   *import org.apache.flume.source.avro.AvroFlumeEvent*




* - Put imports at the top of the file The reader can refer to all imports
in one place. *

Post your thoughts.

Regards,
prabeesh

On Thu, Mar 13, 2014 at 1:49 PM, prabeesh k  wrote:

> example for unblocked import
>
>   import org.eclipse.paho.client.mqttv3.MqttClient
>   import org.eclipse.paho.client.mqttv3.MqttClientPersistence
>   import org.eclipse.paho.client.mqttv3.MqttException
>   import org.eclipse.paho.client.mqttv3.MqttMessage
>   import org.eclipse.paho.client.mqttv3.MqttTopic
>
> this can also be represented using blocked method as follows
>
>   import org.eclipse.paho.client.mqttv3.{MqttClient, MqttException,
> MqttMessage, MqttTopic, MqttClientPersistence}
>
>
>
>
> On Thu, Mar 13, 2014 at 1:39 PM, Prashant Sharma wrote:
>
>>  What exactly do you mean by blocked and unblocked import ?
>>
>> Prashant Sharma
>>
>>
>> On Thu, Mar 13, 2014 at 1:32 PM, prabeesh k  wrote:
>>
>> > Hi All,
>> >
>> > We can import packages in Scala as blocked import and unblocked import.
>> >
>> > I think blocked import is better than other. This method helps to
>>  reduce
>> > LOC.
>> >
>> > But in Spark code using mixed type, It is better choose any of both.
>> >
>> > Please post your thoughts on the  Scala Style for import
>> >
>> > Regards,
>> > prabeesh
>> >
>>
>
>


Re: [DISCUSS] Scala Style for import

2014-03-13 Thread prabeesh k
example for unblocked import

  import org.eclipse.paho.client.mqttv3.MqttClient
  import org.eclipse.paho.client.mqttv3.MqttClientPersistence
  import org.eclipse.paho.client.mqttv3.MqttException
  import org.eclipse.paho.client.mqttv3.MqttMessage
  import org.eclipse.paho.client.mqttv3.MqttTopic

this can also be represented using blocked method as follows

  import org.eclipse.paho.client.mqttv3.{MqttClient, MqttException,
MqttMessage, MqttTopic, MqttClientPersistence}




On Thu, Mar 13, 2014 at 1:39 PM, Prashant Sharma wrote:

> What exactly do you mean by blocked and unblocked import ?
>
> Prashant Sharma
>
>
> On Thu, Mar 13, 2014 at 1:32 PM, prabeesh k  wrote:
>
> > Hi All,
> >
> > We can import packages in Scala as blocked import and unblocked import.
> >
> > I think blocked import is better than other. This method helps to  reduce
> > LOC.
> >
> > But in Spark code using mixed type, It is better choose any of both.
> >
> > Please post your thoughts on the  Scala Style for import
> >
> > Regards,
> > prabeesh
> >
>


[DISCUSS] Scala Style for import

2014-03-13 Thread prabeesh k
Hi All,

We can import packages in Scala as blocked import and unblocked import.

I think blocked import is better than other. This method helps to  reduce
LOC.

But in Spark code using mixed type, It is better choose any of both.

Please post your thoughts on the  Scala Style for import

Regards,
prabeesh