Error - Spark reading from HDFS via dataframes - Java

2017-09-30 Thread Kanagha Kumar
Hi,

I'm trying to read data from HDFS in spark as dataframes. Printing the
schema, I see all columns are being read as strings. I'm converting it to
RDDs and creating another dataframe by passing in the correct schema ( how
the rows should be interpreted finally).

I'm getting the following error:

Caused by: java.lang.RuntimeException: java.lang.String is not a valid
external type for schema of bigint



Spark read API:

Dataset hdfs_dataset = new SQLContext(spark).read().option("header",
"false").csv("hdfs:/inputpath/*");

Dataset ds = new
SQLContext(spark).createDataFrame(hdfs_dataset.toJavaRDD(),
conversionSchema);
This is the schema to be converted to:
StructType(StructField(COL1,StringType,true),
StructField(COL2,StringType,true),
StructField(COL3,LongType,true),
StructField(COL4,StringType,true),
StructField(COL5,StringType,true),
StructField(COL6,LongType,true))

This is the original schema obtained once read API was invoked
StructType(StructField(_c1,StringType,true),
StructField(_c2,StringType,true),
StructField(_c3,StringType,true),
StructField(_c4,StringType,true),
StructField(_c5,StringType,true),
StructField(_c6,StringType,true))

My interpretation is even when a JavaRDD is cast to dataframe by passing in
the new schema, values are not getting type casted.
This is occurring because the above read API reads data as string types
from HDFS.

How can I  convert an RDD to dataframe by passing in the correct schema
once it is read?
How can the values by type cast correctly during this RDD to dataframe
conversion?

Or how can I read data from HDFS with an input schema in java?
Any suggestions are helpful. Thanks!


Re: HDP 2.5 - Python - Spark-On-Hbase

2017-09-30 Thread Debabrata Ghosh
Ayan,
   Did you get to work the HBase Connection through
Pyspark as well ? I have got the Spark - HBase connection working with
Scala (via HBasecontext). However, but I eventually want to get this
working within a Pyspark code - Would you have some suitable code snippets
or approach so that I can call a Scala class within Pyspark ?

Thanks,
Debu

On Wed, Jun 28, 2017 at 3:18 PM, ayan guha  wrote:

> Hi
>
> Thanks for all of you, I could get HBase connector working. there are
> still some details around namespace is pending, but overall it is working
> well.
>
> Now, as usual, I would like to use the same concept into Structured
> Streaming. Is there any similar way I can use writeStream.format and use
> HBase writer? Or any other way to write continuous data to HBase?
>
> best
> Ayan
>
> On Tue, Jun 27, 2017 at 2:15 AM, Weiqing Yang 
> wrote:
>
>> For SHC documentation, please refer the README in SHC github, which is
>> kept up-to-date.
>>
>> On Mon, Jun 26, 2017 at 5:46 AM, ayan guha  wrote:
>>
>>> Thanks all, I have found correct version of the package. Probably HDP
>>> documentation is little behind.
>>>
>>> Best
>>> Ayan
>>>
>>> On Mon, 26 Jun 2017 at 2:16 pm, Mahesh Sawaiker <
>>> mahesh_sawai...@persistent.com> wrote:
>>>
 Ayan,

 The location of the logging class was moved from Spark 1.6 to Spark 2.0.

 Looks like you are trying to run 1.6 code on 2.0, I have ported some
 code like this before and if you have access to the code you can recompile
 it by changing reference to Logging class and directly use the slf4 Logger
 class, most of the code tends to be easily portable.



 Following is the release note for Spark 2.0



 *Removals, Behavior Changes and Deprecations*

 *Removals*

 The following features have been removed in Spark 2.0:

- Bagel
- Support for Hadoop 2.1 and earlier
- The ability to configure closure serializer
- HTTPBroadcast
- TTL-based metadata cleaning
- *Semi-private class org.apache.spark.Logging. We suggest you use
slf4j directly.*
- SparkContext.metricsSystem

 Thanks,

 Mahesh





 *From:* ayan guha [mailto:guha.a...@gmail.com]
 *Sent:* Monday, June 26, 2017 6:26 AM
 *To:* Weiqing Yang
 *Cc:* user
 *Subject:* Re: HDP 2.5 - Python - Spark-On-Hbase



 Hi



 I am using following:



 --packages com.hortonworks:shc:1.0.0-1.6-s_2.10 --repositories
 http://repo.hortonworks.com/content/groups/public/



 Is it compatible with Spark 2.X? I would like to use it



 Best

 Ayan



 On Sat, Jun 24, 2017 at 2:09 AM, Weiqing Yang 
 wrote:

 Yes.

 What SHC version you were using?

 If hitting any issues, you can post them in SHC github issues. There
 are some threads about this.



 On Fri, Jun 23, 2017 at 5:46 AM, ayan guha  wrote:

 Hi



 Is it possible to use SHC from Hortonworks with pyspark? If so, any
 working code sample available?



 Also, I faced an issue while running the samples with Spark 2.0



 "Caused by: java.lang.ClassNotFoundException: org.apache.spark.Logging"



 Any workaround?



 Thanks in advance



 --

 Best Regards,
 Ayan Guha







 --

 Best Regards,
 Ayan Guha
 DISCLAIMER
 ==
 This e-mail may contain privileged and confidential information which
 is the property of Persistent Systems Ltd. It is intended only for the use
 of the individual or entity to which it is addressed. If you are not the
 intended recipient, you are not authorized to read, retain, copy, print,
 distribute or use this message. If you have received this communication in
 error, please notify the sender and delete all copies of this message.
 Persistent Systems Ltd. does not accept any liability for virus infected
 mails.

>>> --
>>> Best Regards,
>>> Ayan Guha
>>>
>>
>>
>
>
> --
> Best Regards,
> Ayan Guha
>


Re: NullPointerException error while saving Scala Dataframe to HBase

2017-09-30 Thread mailfordebu
Hi Guys- am not sure whether the email is reaching to the community members. 
Please can somebody acknowledge 

Sent from my iPhone

> On 30-Sep-2017, at 5:02 PM, Debabrata Ghosh  wrote:
> 
> Dear All,
>Greetings ! I am repeatedly hitting a NullPointerException 
> error while saving a Scala Dataframe to HBase. Please can you help resolving 
> this for me. Here is the code snippet:
> 
> scala> def catalog = s"""{
>  ||"table":{"namespace":"default", "name":"table1"},
>  ||"rowkey":"key",
>  ||"columns":{
>  |  |"col0":{"cf":"rowkey", "col":"key", "type":"string"},
>  |  |"col1":{"cf":"cf1", "col":"col1", "type":"string"}
>  ||}
>  |  |}""".stripMargin
> catalog: String
> 
> scala> case class HBaseRecord(
>  |col0: String,
>  |col1: String)
> defined class HBaseRecord
> 
> scala> val data = (0 to 255).map { i =>  HBaseRecord(i.toString, "extra")}
> data: scala.collection.immutable.IndexedSeq[HBaseRecord] = 
> Vector(HBaseRecord(0,extra), HBaseRecord(1,extra), HBaseRecord
> 
> (2,extra), HBaseRecord(3,extra), HBaseRecord(4,extra), HBaseRecord(5,extra), 
> HBaseRecord(6,extra), HBaseRecord(7,extra), 
> 
> HBaseRecord(8,extra), HBaseRecord(9,extra), HBaseRecord(10,extra), 
> HBaseRecord(11,extra), HBaseRecord(12,extra), 
> 
> HBaseRecord(13,extra), HBaseRecord(14,extra), HBaseRecord(15,extra), 
> HBaseRecord(16,extra), HBaseRecord(17,extra), 
> 
> HBaseRecord(18,extra), HBaseRecord(19,extra), HBaseRecord(20,extra), 
> HBaseRecord(21,extra), HBaseRecord(22,extra), 
> 
> HBaseRecord(23,extra), HBaseRecord(24,extra), HBaseRecord(25,extra), 
> HBaseRecord(26,extra), HBaseRecord(27,extra), 
> 
> HBaseRecord(28,extra), HBaseRecord(29,extra), HBaseRecord(30,extra), 
> HBaseRecord(31,extra), HBase...
> 
> scala> import org.apache.spark.sql.datasources.hbase
> import org.apache.spark.sql.datasources.hbase
>  
> 
> scala> import org.apache.spark.sql.datasources.hbase.{HBaseTableCatalog}
> import org.apache.spark.sql.datasources.hbase.HBaseTableCatalog
> 
> scala> 
> sc.parallelize(data).toDF.write.options(Map(HBaseTableCatalog.tableCatalog -> 
> catalog, HBaseTableCatalog.newTable -> 
> 
> "5")).format("org.apache.hadoop.hbase.spark").save()
> 
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.hbase.spark.HBaseRelation.(DefaultSource.scala:134)
>   at 
> org.apache.hadoop.hbase.spark.DefaultSource.createRelation(DefaultSource.scala:75)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:426)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:215)
>   ... 56 elided
> 
> 
> Thanks in advance !
> 
> Debu
> 

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Structured Streaming and Hive

2017-09-30 Thread Jacek Laskowski
Hi,

Guessing it's a timing issue. Once you started the query the batch 0 did
not have rows to save or didn't start yet (it's a separate thread) and so
spark.sql ran once and saved nothing.

You should rather use foreach writer to save results to Hive.

Jacek

On 29 Sep 2017 11:36 am, "HanPan"  wrote:

> Hi guys,
>
>
>
>  I’m new to spark structured streaming. I’m using 2.1.0 and my
> scenario is reading specific topic from kafka and do some data mining
> tasks, then save the result dataset to hive.
>
>  While writing data to hive, somehow it seems like not supported yet
> and I tried this:
>
>It runs ok, but no result in hive.
>
>
>
>Any idea writing the stream result to hive?
>
>
>
> Thanks
>
> Pan
>
>
>
>
>


NullPointerException error while saving Scala Dataframe to HBase

2017-09-30 Thread Debabrata Ghosh
Dear All,
   Greetings ! I am repeatedly hitting a NullPointerException
error while saving a Scala Dataframe to HBase. Please can you help
resolving this for me. Here is the code snippet:

scala> def catalog = s"""{
 ||"table":{"namespace":"default", "name":"table1"},
 ||"rowkey":"key",
 ||"columns":{
 |  |"col0":{"cf":"rowkey", "col":"key", "type":"string"},
 |  |"col1":{"cf":"cf1", "col":"col1", "type":"string"}
 ||}
 |  |}""".stripMargin
catalog: String

scala> case class HBaseRecord(
 |col0: String,
 |col1: String)
defined class HBaseRecord

scala> val data = (0 to 255).map { i =>  HBaseRecord(i.toString, "extra")}
data: scala.collection.immutable.IndexedSeq[HBaseRecord] =
Vector(HBaseRecord(0,extra), HBaseRecord(1,extra), HBaseRecord

(2,extra), HBaseRecord(3,extra), HBaseRecord(4,extra),
HBaseRecord(5,extra), HBaseRecord(6,extra), HBaseRecord(7,extra),

HBaseRecord(8,extra), HBaseRecord(9,extra), HBaseRecord(10,extra),
HBaseRecord(11,extra), HBaseRecord(12,extra),

HBaseRecord(13,extra), HBaseRecord(14,extra), HBaseRecord(15,extra),
HBaseRecord(16,extra), HBaseRecord(17,extra),

HBaseRecord(18,extra), HBaseRecord(19,extra), HBaseRecord(20,extra),
HBaseRecord(21,extra), HBaseRecord(22,extra),

HBaseRecord(23,extra), HBaseRecord(24,extra), HBaseRecord(25,extra),
HBaseRecord(26,extra), HBaseRecord(27,extra),

HBaseRecord(28,extra), HBaseRecord(29,extra), HBaseRecord(30,extra),
HBaseRecord(31,extra), HBase...

scala> import org.apache.spark.sql.datasources.hbase
import org.apache.spark.sql.datasources.hbase


scala> import org.apache.spark.sql.datasources.hbase.{HBaseTableCatalog}
import org.apache.spark.sql.datasources.hbase.HBaseTableCatalog

scala>
sc.parallelize(data).toDF.write.options(Map(HBaseTableCatalog.tableCatalog
-> catalog, HBaseTableCatalog.newTable ->

"5")).format("org.apache.hadoop.hbase.spark").save()

java.lang.NullPointerException
  at
org.apache.hadoop.hbase.spark.HBaseRelation.(DefaultSource.scala:134)
  at
org.apache.hadoop.hbase.spark.DefaultSource.createRelation(DefaultSource.scala:75)
  at
org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:426)
  at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:215)
  ... 56 elided


Thanks in advance !

Debu


Re: HDFS or NFS as a cache?

2017-09-30 Thread Steve Loughran

On 29 Sep 2017, at 20:03, JG Perrin 
mailto:jper...@lumeris.com>> wrote:

You will collect in the driver (often the master) and it will save the data, so 
for saving, you will not have to set up HDFS.

no, it doesn't work quite like that.

1. workers generate their data and save somwhere
2. on "task commit" they move their data to some location where it will be 
visible for "job commit" (rename, upload, whatever)
3. job commit —which is done in the driver,— takes all the committed task data 
and makes it visible in the destination directory.
4. Then they create a _SUCCESS file to say "done!"


This is done with Spark talking between workers and drivers to guarantee that 
only one task working on a specific part of the data commits their work, only
committing the job once all tasks have finished

The v1 mapreduce committer implements (2) by moving files under a job attempt 
dir, and (3) by moving it from the job attempt dir to the destination. one 
rename per task commit, another rename of every file on job commit. In HFDS, 
Azure wasb and other stores with an O(1) atomic rename, this isn't *too* 
expensve, though that final job commit rename still takes time to list and move 
lots of files

The v2 committer implements (2) by renaming to the destination directory and 
(3) as a no-op. Rename in the tasks then, but not not that second, serialized 
one at the end

There's no copy of data from workers to driver, instead you need a shared 
output filesystem so that the job committer can do its work alongside the tasks.

There are alternatives committer agorithms,

1. look at Ryan Blue's talk: https://www.youtube.com/watch?v=BgHrff5yAQo
2. IBM Stocator paper (https://arxiv.org/abs/1709.01812) and code 
(https://github.com/SparkTC/stocator/)
3. Ongoing work in Hadoop itself for better committers. Goal: year end & Hadoop 
3.1 https://issues.apache.org/jira/browse/HADOOP-13786 . The oode is all there, 
Parquet is a troublespot, and more testing is welcome from anyone who wants to 
help.
4. Databricks have "something"; specifics aren't covered, but I assume its 
dynamo DB based


-Steve






From: Alexander Czech [mailto:alexander.cz...@googlemail.com]
Sent: Friday, September 29, 2017 8:15 AM
To: user@spark.apache.org
Subject: HDFS or NFS as a cache?

I have a small EC2 cluster with 5 c3.2xlarge nodes and I want to write parquet 
files to S3. But the S3 performance for various reasons is bad when I access s3 
through the parquet write method:

df.write.parquet('s3a://bucket/parquet')
Now I want to setup a small cache for the parquet output. One output is about 
12-15 GB in size. Would it be enough to setup a NFS-directory on the master, 
write the output to it and then move it to S3? Or should I setup a HDFS on the 
Master? Or should I even opt for an additional cluster running a HDFS solution 
on more than one node?
thanks!



Re: HDFS or NFS as a cache?

2017-09-30 Thread Steve Loughran

On 29 Sep 2017, at 15:59, Alexander Czech 
mailto:alexander.cz...@googlemail.com>> wrote:

Yes I have identified the rename as the problem, that is why I think the extra 
bandwidth of the larger instances might not help. Also there is a consistency 
issue with S3 because of the how the rename works so that I probably lose data.

correct

rename is mimicked with a COPY + DELETE; copy is in S3 and your bandwidth 
appears to be 6-10 MB/s


On Fri, Sep 29, 2017 at 4:42 PM, Vadim Semenov 
mailto:vadim.seme...@datadoghq.com>> wrote:
How many files you produce? I believe it spends a lot of time on renaming the 
files because of the output committer.
Also instead of 5x c3.2xlarge try using 2x c3.8xlarge instead because they have 
10GbE and you can get good throughput for S3.

On Fri, Sep 29, 2017 at 9:15 AM, Alexander Czech 
mailto:alexander.cz...@googlemail.com>> wrote:
I have a small EC2 cluster with 5 c3.2xlarge nodes and I want to write parquet 
files to S3. But the S3 performance for various reasons is bad when I access s3 
through the parquet write method:

df.write.parquet('s3a://bucket/parquet')

Now I want to setup a small cache for the parquet output. One output is about 
12-15 GB in size. Would it be enough to setup a NFS-directory on the master, 
write the output to it and then move it to S3? Or should I setup a HDFS on the 
Master? Or should I even opt for an additional cluster running a HDFS solution 
on more than one node?

thanks!





py4j.protocol.Py4JNetworkError: Error while receiving Socket.timeout: timed out

2017-09-30 Thread Krishnaprasad
Hi all,

I am developing an application that can run on Apache Spark (setup on single
node) and as part of the implementation, I am using PySpark version 2.2.0. 

Environment - OS is Ubuntu 14.04 and Python version is 3.4.

I am getting the following error as shown below. It will be helpful if
somebody can suggest a quick resolution / work around for this problem:

Traceback (most recent call last):
File
"/usr/local/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py",
line 1028, in send_command
answer = smart_decode(self.stream.readline()[:-1])
File "/usr/lib/python3.4/socket.py", line 374, in readinto return
self._sock.recv_into(b)
socket.timeout: timed out
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File
"/usr/local/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py",
line 883, in send_command
response = connection.send_command(command) File
"/usr/local/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py",
line 1040, in send_command
"Error while receiving", e, proto.ERROR_ON_RECEIVE)
py4j.protocol.Py4JNetworkError: Error while receiving Process Process-1:
Traceback (most recent call last):
File "/usr/lib/python3.4/multiprocessing/process.py", line 254, in
_bootstrap
self.run()
File "/usr/lib/python3.4/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs) File
"/usr/local/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py",
line 1133, in _call_
answer, self.gateway_client, self.target_id, self.name) File
"/usr/local/spark-2.2.0-bin-hadoop2.7/python/pyspark/sql/utils.py",
line 63, in deco
return f(*a, **kw)
File
"/usr/local/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py",
line 327, in get_return_value
format(target_id, ".", name))
py4j.protocol.Py4JError: An error occurred while calling o180.fit

Thanks,
Krishnaprasad



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org