Re: Parquet file generated by Spark, but not compatible read by Hive

2017-06-12 Thread ayan guha
Try setting following Param:

conf.set("spark.sql.hive.convertMetastoreParquet","false")

On Tue, Jun 13, 2017 at 3:34 PM, Angel Francisco Orta <
angel.francisco.o...@gmail.com> wrote:

> Hello,
>
> Do you use df.write or you make with hivecontext.sql(" insert into ...")?
>
> Angel.
>
> El 12 jun. 2017 11:07 p. m., "Yong Zhang"  escribió:
>
>> We are using Spark *1.6.2* as ETL to generate parquet file for one
>> dataset, and partitioned by "brand" (which is a string to represent brand
>> in this dataset).
>>
>>
>> After the partition files generated in HDFS like "brand=a" folder, we add
>> the partitions in the Hive.
>>
>>
>> The hive version is *1.2.1 *(In fact, we are using HDP 2.5.0).
>>
>>
>> Now the problem is that for 2 brand partitions, we cannot query the data
>> generated in Spark, but it works fine for the rest of partitions.
>>
>>
>> Below is the error in the Hive CLI and hive.log I got if I query the bad
>> partitions like "select * from  tablename where brand='*BrandA*' limit
>> 3;"
>>
>>
>> Failed with exception java.io.IOException:org.apache
>> .hadoop.hive.ql.metadata.HiveException: 
>> java.lang.UnsupportedOperationException:
>> Cannot inspect org.apache.hadoop.io.LongWritable
>>
>>
>> Caused by: java.lang.UnsupportedOperationException: Cannot inspect
>> org.apache.hadoop.io.LongWritable
>> at org.apache.hadoop.hive.ql.io.parquet.serde.primitive.Parquet
>> StringInspector.getPrimitiveWritableObject(ParquetStringInsp
>> ector.java:52)
>> at org.apache.hadoop.hive.serde2.lazy.LazyUtils.writePrimitiveU
>> TF8(LazyUtils.java:222)
>> at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serialize
>> (LazySimpleSerDe.java:307)
>> at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serialize
>> Field(LazySimpleSerDe.java:262)
>> at org.apache.hadoop.hive.serde2.DelimitedJSONSerDe.serializeFi
>> eld(DelimitedJSONSerDe.java:72)
>> at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.doSeriali
>> ze(LazySimpleSerDe.java:246)
>> at org.apache.hadoop.hive.serde2.AbstractEncodingAwareSerDe.ser
>> ialize(AbstractEncodingAwareSerDe.java:50)
>> at org.apache.hadoop.hive.ql.exec.DefaultFetchFormatter.convert
>> (DefaultFetchFormatter.java:71)
>> at org.apache.hadoop.hive.ql.exec.DefaultFetchFormatter.convert
>> (DefaultFetchFormatter.java:40)
>> at org.apache.hadoop.hive.ql.exec.ListSinkOperator.process(List
>> SinkOperator.java:90)
>> ... 22 more
>>
>> There are not too much I can find by googling this error message, but it
>> points to that the schema in Hive is different as in parquet file.
>> But this is a very strange case, as the same schema works fine for other
>> brands, which defined as a partition column, and share the whole Hive
>> schema as the above.
>>
>> If I query like: "select * from tablename where brand='*BrandB*' limit
>> 3:", everything works fine.
>>
>> So is this really caused by the Hive schema mismatch with parquet file
>> generated by Spark, or by the data within different partitioned keys, or
>> really a compatible issue between Spark/Hive?
>>
>> Thanks
>>
>> Yong
>>
>>
>>


-- 
Best Regards,
Ayan Guha


Re: Parquet file generated by Spark, but not compatible read by Hive

2017-06-12 Thread Angel Francisco Orta
Hello,

Do you use df.write or you make with hivecontext.sql(" insert into ...")?

Angel.

El 12 jun. 2017 11:07 p. m., "Yong Zhang"  escribió:

> We are using Spark *1.6.2* as ETL to generate parquet file for one
> dataset, and partitioned by "brand" (which is a string to represent brand
> in this dataset).
>
>
> After the partition files generated in HDFS like "brand=a" folder, we add
> the partitions in the Hive.
>
>
> The hive version is *1.2.1 *(In fact, we are using HDP 2.5.0).
>
>
> Now the problem is that for 2 brand partitions, we cannot query the data
> generated in Spark, but it works fine for the rest of partitions.
>
>
> Below is the error in the Hive CLI and hive.log I got if I query the bad
> partitions like "select * from  tablename where brand='*BrandA*' limit 3;"
>
>
> Failed with exception 
> java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException:
> java.lang.UnsupportedOperationException: Cannot inspect
> org.apache.hadoop.io.LongWritable
>
>
> Caused by: java.lang.UnsupportedOperationException: Cannot inspect
> org.apache.hadoop.io.LongWritable
> at org.apache.hadoop.hive.ql.io.parquet.serde.primitive.
> ParquetStringInspector.getPrimitiveWritableObject(
> ParquetStringInspector.java:52)
> at org.apache.hadoop.hive.serde2.lazy.LazyUtils.
> writePrimitiveUTF8(LazyUtils.java:222)
> at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.
> serialize(LazySimpleSerDe.java:307)
> at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serializeField(
> LazySimpleSerDe.java:262)
> at org.apache.hadoop.hive.serde2.DelimitedJSONSerDe.serializeField(
> DelimitedJSONSerDe.java:72)
> at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.
> doSerialize(LazySimpleSerDe.java:246)
> at org.apache.hadoop.hive.serde2.AbstractEncodingAwareSerDe.serialize(
> AbstractEncodingAwareSerDe.java:50)
> at org.apache.hadoop.hive.ql.exec.DefaultFetchFormatter.
> convert(DefaultFetchFormatter.java:71)
> at org.apache.hadoop.hive.ql.exec.DefaultFetchFormatter.
> convert(DefaultFetchFormatter.java:40)
> at org.apache.hadoop.hive.ql.exec.ListSinkOperator.process(
> ListSinkOperator.java:90)
> ... 22 more
>
> There are not too much I can find by googling this error message, but it
> points to that the schema in Hive is different as in parquet file.
> But this is a very strange case, as the same schema works fine for other
> brands, which defined as a partition column, and share the whole Hive
> schema as the above.
>
> If I query like: "select * from tablename where brand='*BrandB*' limit
> 3:", everything works fine.
>
> So is this really caused by the Hive schema mismatch with parquet file
> generated by Spark, or by the data within different partitioned keys, or
> really a compatible issue between Spark/Hive?
>
> Thanks
>
> Yong
>
>
>


Re: Use SQL Script to Write Spark SQL Jobs

2017-06-12 Thread Benjamin Kim
Hi Bo,

+1 for your project. I come from the world of data warehouses, ETL, and 
reporting analytics. There are many individuals who do not know or want to do 
any coding. They are content with ANSI SQL and stick to it. ETL workflows are 
also done without any coding using a drag-and-drop user interface, such as 
Talend, SSIS, etc. There is a small amount of scripting involved but not too 
much. I looked at what you are trying to do, and I welcome it. This could open 
up Spark to the masses and shorten development times.

Cheers,
Ben


> On Jun 12, 2017, at 10:14 PM, bo yang  wrote:
> 
> Hi Aakash,
> 
> Thanks for your willing to help :) It will be great if I could get more 
> feedback on my project. For example, is there any other people feeling the 
> need of using a script to write Spark job easily? Also, I would explore 
> whether it is possible that the Spark project takes some work to build such a 
> script based high level DSL.
> 
> Best,
> Bo
> 
> 
> On Mon, Jun 12, 2017 at 12:14 PM, Aakash Basu  > wrote:
> Hey,
> 
> I work on Spark SQL and would pretty much be able to help you in this. Let me 
> know your requirement.
> 
> Thanks,
> Aakash.
> 
> On 12-Jun-2017 11:00 AM, "bo yang"  > wrote:
> Hi Guys,
> 
> I am writing a small open source project 
>  to use SQL Script to write Spark 
> Jobs. Want to see if there are other people interested to use or contribute 
> to this project.
> 
> The project is called UberScriptQuery 
> (https://github.com/uber/uberscriptquery 
> ). Sorry for the dumb name to avoid 
> conflict with many other names (Spark is registered trademark, thus I could 
> not use Spark in my project name).
> 
> In short, it is a high level SQL-like DSL (Domain Specific Language) on top 
> of Spark. People can use that DSL to write Spark jobs without worrying about 
> Spark internal details. Please check README 
>  in the project to get more details.
> 
> It will be great if I could get any feedback or suggestions!
> 
> Best,
> Bo
> 
> 



Re: Use SQL Script to Write Spark SQL Jobs

2017-06-12 Thread bo yang
Hi Aakash,

Thanks for your willing to help :) It will be great if I could get more
feedback on my project. For example, is there any other people feeling the
need of using a script to write Spark job easily? Also, I would explore
whether it is possible that the Spark project takes some work to build such
a script based high level DSL.

Best,
Bo


On Mon, Jun 12, 2017 at 12:14 PM, Aakash Basu 
wrote:

> Hey,
>
> I work on Spark SQL and would pretty much be able to help you in this. Let
> me know your requirement.
>
> Thanks,
> Aakash.
>
> On 12-Jun-2017 11:00 AM, "bo yang"  wrote:
>
>> Hi Guys,
>>
>> I am writing a small open source project
>>  to use SQL Script to write
>> Spark Jobs. Want to see if there are other people interested to use or
>> contribute to this project.
>>
>> The project is called UberScriptQuery (https://githu
>> b.com/uber/uberscriptquery). Sorry for the dumb name to avoid conflict
>> with many other names (Spark is registered trademark, thus I could not use
>> Spark in my project name).
>>
>> In short, it is a high level SQL-like DSL (Domain Specific Language) on
>> top of Spark. People can use that DSL to write Spark jobs without worrying
>> about Spark internal details. Please check README
>>  in the project to get more
>> details.
>>
>> It will be great if I could get any feedback or suggestions!
>>
>> Best,
>> Bo
>>
>>


Re: Deciphering spark warning "Truncated the string representation of a plan since it was too large."

2017-06-12 Thread lucas.g...@gmail.com
AFAIK the process a spark program follows is:

   1. A set of transformations are defined on a given input dataset.
   2. At some point an action is called
  1. In your case this is writing to your parquet file.
   3. When that happens spark creates a logical plan and then a physical
   plan (This is largely where your transformations are optimized) to perform
   the transformations specified.
  1. This is similar to what a sql engine does, it takes your raw SQL
  and turns it into something that it can execute to get the data you
  requested.
  2. There are a set of artifacts generated, one of those artifacts
  would be the plan that you're seeing is being truncated.

The only time I'd be concerned about this would be if I was debugging the
code and needed to see what was being truncated, it is after all a debug
setting ('spark.debug.maxToStringFields')

Good luck!

Gary

On 12 June 2017 at 15:10, Henry M  wrote:

>
>
> I am trying to understand if I should be concerned about this warning:
>
> "WARN  Utils:66 - Truncated the string representation of a plan since it
> was too large. This behavior can be adjusted by setting 
> 'spark.debug.maxToStringFields'
> in SparkEnv.conf"
>
> It occurs while writing a data frame to parquet.
>
> Has any one on this list looked into this warning before and could help
> explain what it means?
>
> Thank you for your help,
> Henry
>


Deciphering spark warning "Truncated the string representation of a plan since it was too large."

2017-06-12 Thread Henry M
I am trying to understand if I should be concerned about this warning:

"WARN  Utils:66 - Truncated the string representation of a plan since it
was too large. This behavior can be adjusted by setting
'spark.debug.maxToStringFields' in SparkEnv.conf"

It occurs while writing a data frame to parquet.

Has any one on this list looked into this warning before and could help
explain what it means?

Thank you for your help,
Henry


broadcast() multiple times the same df. Is it cached ?

2017-06-12 Thread matd
Hi spark folks,

In our application, we have to join a dataframe with several other df (not
always the same joining column).

This left-hand side df is not very large, so a broadcast hint may be
beneficial.

My questions :
- if the same df get broadcast multiple times, will the transfer occur once
(the broadcast data is somehow cached on executors), or multiple times ?
- If the join concern different cols, will it be cached as well ?

Thanks for your insights
Mathieu




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/broadcast-multiple-times-the-same-df-Is-it-cached-tp28756.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Parquet file generated by Spark, but not compatible read by Hive

2017-06-12 Thread Yong Zhang
We are using Spark 1.6.2 as ETL to generate parquet file for one dataset, and 
partitioned by "brand" (which is a string to represent brand in this dataset).


After the partition files generated in HDFS like "brand=a" folder, we add the 
partitions in the Hive.


The hive version is 1.2.1 (In fact, we are using HDP 2.5.0).


Now the problem is that for 2 brand partitions, we cannot query the data 
generated in Spark, but it works fine for the rest of partitions.


Below is the error in the Hive CLI and hive.log I got if I query the bad 
partitions like "select * from  tablename where brand='BrandA' limit 3;"


Failed with exception 
java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException: 
java.lang.UnsupportedOperationException: Cannot inspect 
org.apache.hadoop.io.LongWritable


Caused by: java.lang.UnsupportedOperationException: Cannot inspect 
org.apache.hadoop.io.LongWritable
at 
org.apache.hadoop.hive.ql.io.parquet.serde.primitive.ParquetStringInspector.getPrimitiveWritableObject(ParquetStringInspector.java:52)
at 
org.apache.hadoop.hive.serde2.lazy.LazyUtils.writePrimitiveUTF8(LazyUtils.java:222)
at 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serialize(LazySimpleSerDe.java:307)
at 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serializeField(LazySimpleSerDe.java:262)
at 
org.apache.hadoop.hive.serde2.DelimitedJSONSerDe.serializeField(DelimitedJSONSerDe.java:72)
at 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.doSerialize(LazySimpleSerDe.java:246)
at 
org.apache.hadoop.hive.serde2.AbstractEncodingAwareSerDe.serialize(AbstractEncodingAwareSerDe.java:50)
at 
org.apache.hadoop.hive.ql.exec.DefaultFetchFormatter.convert(DefaultFetchFormatter.java:71)
at 
org.apache.hadoop.hive.ql.exec.DefaultFetchFormatter.convert(DefaultFetchFormatter.java:40)
at 
org.apache.hadoop.hive.ql.exec.ListSinkOperator.process(ListSinkOperator.java:90)
... 22 more

There are not too much I can find by googling this error message, but it points 
to that the schema in Hive is different as in parquet file.
But this is a very strange case, as the same schema works fine for other 
brands, which defined as a partition column, and share the whole Hive schema as 
the above.

If I query like: "select * from tablename where brand='BrandB' limit 3:", 
everything works fine.

So is this really caused by the Hive schema mismatch with parquet file 
generated by Spark, or by the data within different partitioned keys, or really 
a compatible issue between Spark/Hive?

Thanks

Yong




Re: [Spark JDBC] Does spark support read from remote Hive server via JDBC

2017-06-12 Thread Thakrar, Jayesh
Could this be due to https://issues.apache.org/jira/browse/HIVE-6 ?

From: Patrik Medvedev 
Date: Monday, June 12, 2017 at 2:31 AM
To: Jörn Franke , vaquar khan 
Cc: Jean Georges Perrin , User 
Subject: Re: [Spark JDBC] Does spark support read from remote Hive server via 
JDBC

Hello,

All security checkings disabled, but i still don't have any info in result.


вс, 11 июн. 2017 г. в 14:24, Jörn Franke 
>:
Is sentry preventing the access?

On 11. Jun 2017, at 01:55, vaquar khan 
> wrote:
Hi ,
Pleaae check your firewall security setting sharing link one good link.

http://belablotski.blogspot.in/2016/01/access-hive-tables-from-spark-using.html?m=1



Regards,
Vaquar khan

On Jun 8, 2017 1:53 AM, "Patrik Medvedev" 
> wrote:
Hello guys,

Can somebody help me with my problem?
Let me know, if you need more details.


ср, 7 июн. 2017 г. в 16:43, Patrik Medvedev 
>:
No, I don't.

ср, 7 июн. 2017 г. в 16:42, Jean Georges Perrin 
>:
Do you have some other security in place like Kerberos or impersonation? It may 
affect your access.


jg


On Jun 7, 2017, at 02:15, Patrik Medvedev 
> wrote:
Hello guys,

I need to execute hive queries on remote hive server from spark, but for some 
reasons i receive only column names(without data).
Data available in table, i checked it via HUE and java jdbc connection.

Here is my code example:
val test = spark.read
.option("url", "jdbc:hive2://remote.hive.server:1/work_base")
.option("user", "user")
.option("password", "password")
.option("dbtable", "some_table_with_data")
.option("driver", "org.apache.hive.jdbc.HiveDriver")
.format("jdbc")
.load()
test.show()


Scala version: 2.11
Spark version: 2.1.0, i also tried 2.1.1
Hive version: CDH 5.7 Hive 1.1.1
Hive JDBC version: 1.1.1

But this problem available on Hive with later versions, too.
Could you help me with this issue, because i didn't find anything in mail group 
answers and StackOverflow.
Or could you help me find correct solution how to query remote hive from spark?

--
Cheers,
Patrick


Re: Use SQL Script to Write Spark SQL Jobs

2017-06-12 Thread Aakash Basu
Hey,

I work on Spark SQL and would pretty much be able to help you in this. Let
me know your requirement.

Thanks,
Aakash.

On 12-Jun-2017 11:00 AM, "bo yang"  wrote:

> Hi Guys,
>
> I am writing a small open source project
>  to use SQL Script to write
> Spark Jobs. Want to see if there are other people interested to use or
> contribute to this project.
>
> The project is called UberScriptQuery (https://github.com/uber/
> uberscriptquery). Sorry for the dumb name to avoid conflict with many
> other names (Spark is registered trademark, thus I could not use Spark in
> my project name).
>
> In short, it is a high level SQL-like DSL (Domain Specific Language) on
> top of Spark. People can use that DSL to write Spark jobs without worrying
> about Spark internal details. Please check README
>  in the project to get more
> details.
>
> It will be great if I could get any feedback or suggestions!
>
> Best,
> Bo
>
>


RE: What is the real difference between Kafka streaming and Spark Streaming?

2017-06-12 Thread Mohammed Guller
Regarding Spark scheduler – if you are referring to the ability to distribute 
workload and scale, Kafka Streaming also provides that capability. It is 
deceptively simple in that regard if you already have a Kafka cluster. You can 
launch multiple instances of your Kafka streaming application and Kafka 
streaming will automatically balance the workload across different instances. 
It rebalances workload as you add or remove instances. Similarly, if an 
instance fails or crash, it will automatically detect that.

Regarding real-time – rather than debating which one is real-time, I would look 
at the latency requirements of my application. For most applications, the near 
real time capabilities of Spark Streaming might be good enough. For others, it 
may not.  For example, if I was building a high-frequency trading application, 
where I want to process individual trades as soon as they happen, I might lean 
towards Kafka streaming.

Agree about the benefits of using SQL with structured streaming.

Mohammed

From: kant kodali [mailto:kanth...@gmail.com]
Sent: Sunday, June 11, 2017 3:41 PM
To: Mohammed Guller 
Cc: vincent gromakowski ; yohann jardin 
; vaquar khan ; user 

Subject: Re: What is the real difference between Kafka streaming and Spark 
Streaming?

Also another difference I see is some thing like Spark Sql where there are 
logical plans, physical plans, Code generation and all those optimizations I 
don't see them in Kafka Streaming at this time.

On Sun, Jun 11, 2017 at 2:19 PM, kant kodali 
> wrote:
I appreciate the responses however I see the other side of the argument and I 
actually feel they are competitors now in Streaming space in some sense.

Kafka Streaming can indeed do map, reduce, join and window operations and Like 
wise data can be ingested from many sources in Kafka and send the results out 
to many sinks. Look up "Kafka Connect"

Regarding Event at a time vs Micro-batch. I hear arguments from a group of 
people saying Spark Streaming is real time and other group of people is Kafka 
streaming is the true real time. so do we say Micro-batch is real time or Event 
at a time is real time?

It is well known fact that Spark is more popular with Data scientists who want 
to run ML Algorithms and so on but I also hear that people can use H2O package 
along with Kafka Streaming. so efficient each of these approaches are is 
something I have no clue.

The major difference I see is actually the Spark Scheduler I don't think Kafka 
Streaming has anything like this instead it just allows you to run lambda 
expressions on a stream and write it out to specific topic/partition and from 
there one can use Kafka Connect to write it out to any sink. so In short, All 
the optimizations built into spark scheduler don't seem to exist in Kafka 
Streaming so if I were to make a decision on which framework to use this is an 
additional question I would think about like "Do I want my stream to go through 
the scheduler and if so, why or why not"

Above all, please correct me if I am wrong :)




On Sun, Jun 11, 2017 at 12:41 PM, Mohammed Guller 
> wrote:
Just to elaborate more on Vincent wrote – Kafka streaming provides true 
record-at-a-time processing capabilities whereas Spark Streaming provides 
micro-batching capabilities on top of Spark. Depending on your use case, you 
may find one better than the other. Both provide stateless ad stateful stream 
processing capabilities.

A few more things to consider:

  1.  If you don’t already have a Spark cluster, but have Kafka cluster, it may 
be easier to use Kafka streaming since you don’t need to setup and manage 
another cluster.
  2.  On the other hand, if you already have a spark cluster, but don’t have a 
Kafka cluster (in case you are using some other messaging system), Spark 
streaming is a better option.
  3.  If you already know and use Spark, you may find it easier to program with 
Spark Streaming API even if you are using Kafka.
  4.  Spark Streaming may give you better throughput. So you have to decide 
what is more important for your stream processing application – latency or 
throughput?
  5.  Kafka streaming is relatively new and less mature than Spark Streaming

Mohammed

From: vincent gromakowski 
[mailto:vincent.gromakow...@gmail.com]
Sent: Sunday, June 11, 2017 12:09 PM
To: yohann jardin >
Cc: kant kodali >; vaquar khan 
>; user 
>
Subject: Re: What is the real difference between Kafka streaming and Spark 
Streaming?

I think Kafka streams is good when the processing of each 

Re: [E] Re: Spark Job is stuck at SUBMITTED when set Driver Memory > Executor Memory

2017-06-12 Thread Rastogi, Pankaj
Please make sure that you have enough memory available on the driver node. If 
there is not enough free memory on the driver node, then your application won't 
start.

Pankaj

From: vaquar khan >
Date: Saturday, June 10, 2017 at 5:02 PM
To: Abdulfattah Safa >
Cc: User >
Subject: [E] Re: Spark Job is stuck at SUBMITTED when set Driver Memory > 
Executor Memory

You can add memory in your command make sure given memory available on your 
executor


./bin/spark-submit \
  --class org.apache.spark.examples.SparkPi \
  --master 
spark://207.184.161.138:7077
 \
  --executor-memory 20G \
  --total-executor-cores 100 \
  /path/to/examples.jar \
  1000


https://spark.apache.org/docs/1.1.0/submitting-applications.html

Also try to avoid function need memory like collect etc.


Regards,
Vaquar khan


On Jun 4, 2017 5:46 AM, "Abdulfattah Safa" 
> wrote:
I'm working on Spark with Standalone Cluster mode. I need to increase the 
Driver Memory as I got OOM in t he driver thread. If found that when setting  
the Driver Memory to > Executor Memory, the submitted job is stuck at Submitted 
in the driver and the application never starts.



Re: [How-To] Custom file format as source

2017-06-12 Thread Vadim Semenov
It should be easy to start with a custom Hadoop InputFormat that reads the
file and creates a `RDD[Row]`, since you know the records size, it should
be pretty easy to make the InputFormat to produce splits, so then you could
read the file in parallel.

On Mon, Jun 12, 2017 at 6:01 AM, OBones  wrote:

> Hello,
>
> I have an application here that generates data files in a custom binary
> format that provides the following information:
>
> Column list, each column has a data type (64 bit integer, 32 bit string
> index, 64 bit IEEE float, 1 byte boolean)
> Catalogs that give modalities for some columns (ie, column 1 contains only
> the following values: A, B, C, D)
> Array for actual data, each row has a fixed size according to the columns.
>
> Here is an example:
>
> Col1, 64bit integer
> Col2, 32bit string index
> Col3, 64bit integer
> Col4, 64bit float
>
> Catalog for Col1 = 10, 20, 30, 40, 50
> Catalog for Col2 = Big, Small, Large, Tall
> Catalog for Col3 = 101, 102, 103, 500, 5000
> Catalog for Col4 = (no catalog)
>
> Data array =
> 8 bytes, 4 bytes, 8 bytes, 8 bytes,
> 8 bytes, 4 bytes, 8 bytes, 8 bytes,
> 8 bytes, 4 bytes, 8 bytes, 8 bytes,
> 8 bytes, 4 bytes, 8 bytes, 8 bytes,
> 8 bytes, 4 bytes, 8 bytes, 8 bytes,
> ...
>
> I would like to use this kind of file as a source for various ML related
> computations (CART, RandomForrest, Gradient boosting...) and Spark is very
> interesting in this area.
> However, I'm a bit lost as to what I should write to have Spark use that
> file format as a source for its computation. Considering that those files
> are quite big (100 million lines, hundreds of gigs on disk), I'd rather not
> create something that writes a new file in a built-in format, but I'd
> rather write some code that makes Spark accept the file as it is.
>
> I looked around and saw the textfile method but it is not applicable to my
> case. I also saw the spark.read.format("libsvm") syntax which tells me that
> there is a list of supported formats known to spark, which I believe are
> called Dataframes, but I could not find any tutorial on this subject.
>
> Would you have any suggestion or links to documentation that would get me
> started?
>
> Regards,
> Olivier
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: SPARK environment settings issue when deploying a custom distribution

2017-06-12 Thread Chanh Le
Just add more information how I build the custom distribution.
I clone spark repo then switch to branch 2.2 then make distribution that
following.

λ ~/workspace/big_data/spark/ branch-2.2*
λ ~/workspace/big_data/spark/ ./dev/make-distribution.sh --name custom
--tgz -Phadoop-2.7 -Dhadoop.version=2.7.0 -Phive -Phive-thriftserver
-Pmesos -Pyarn



On Mon, Jun 12, 2017 at 6:14 PM Chanh Le  wrote:

> Hi everyone,
>
> Recently I discovered an issue when processing csv of spark. So I decided
> to fix it following this https://issues.apache.org/jira/browse/SPARK-21024 I
> built a custom distribution for internal uses. I built it in my local
> machine then upload the distribution to server.
>
> server's *~/.bashrc*
>
> # added by Anaconda2 4.3.1 installer
> export PATH="/opt/etl/anaconda/anaconda2/bin:$PATH"
> export SPARK_HOME="/opt/etl/spark-2.1.0-bin-hadoop2.7"
> export
> PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.10.4-src.zip:$PYTHONPATH
>
> What I did on server was:
> export SPARK_HOME=/home/etladmin/spark-2.2.1-SNAPSHOT-bin-custom
>
> $SPARK_HOME/bin/spark-submit --version
> It print out version *2.1.1* which* is not* the version I built (2.2.1)
>
>
> I did set *SPARK_HOME* in my local machine (MACOS) for this distribution
> and it's working well, print out the version *2.2.1*
>
> I need the way to investigate the invisible environment variable.
>
> Do you have any suggestions?
> Thank in advance.
>
> Regards,
> Chanh
>
> --
> Regards,
> Chanh
>
-- 
Regards,
Chanh


SPARK environment settings issue when deploying a custom distribution

2017-06-12 Thread Chanh Le
Hi everyone,

Recently I discovered an issue when processing csv of spark. So I decided
to fix it following this https://issues.apache.org/jira/browse/SPARK-21024 I
built a custom distribution for internal uses. I built it in my local
machine then upload the distribution to server.

server's *~/.bashrc*

# added by Anaconda2 4.3.1 installer
export PATH="/opt/etl/anaconda/anaconda2/bin:$PATH"
export SPARK_HOME="/opt/etl/spark-2.1.0-bin-hadoop2.7"
export
PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.10.4-src.zip:$PYTHONPATH

What I did on server was:
export SPARK_HOME=/home/etladmin/spark-2.2.1-SNAPSHOT-bin-custom

$SPARK_HOME/bin/spark-submit --version
It print out version *2.1.1* which* is not* the version I built (2.2.1)


I did set *SPARK_HOME* in my local machine (MACOS) for this distribution
and it's working well, print out the version *2.2.1*

I need the way to investigate the invisible environment variable.

Do you have any suggestions?
Thank in advance.

Regards,
Chanh

-- 
Regards,
Chanh


RE: [How-To] Custom file format as source

2017-06-12 Thread Mendelson, Assaf
Try 
https://mapr.com/blog/spark-data-source-api-extending-our-spark-sql-query-engine/
 

Thanks,
  Assaf.

-Original Message-
From: OBones [mailto:obo...@free.fr] 
Sent: Monday, June 12, 2017 1:01 PM
To: user@spark.apache.org
Subject: [How-To] Custom file format as source

Hello,

I have an application here that generates data files in a custom binary format 
that provides the following information:

Column list, each column has a data type (64 bit integer, 32 bit string index, 
64 bit IEEE float, 1 byte boolean) Catalogs that give modalities for some 
columns (ie, column 1 contains only the following values: A, B, C, D) Array for 
actual data, each row has a fixed size according to the columns.

Here is an example:

Col1, 64bit integer
Col2, 32bit string index
Col3, 64bit integer
Col4, 64bit float

Catalog for Col1 = 10, 20, 30, 40, 50
Catalog for Col2 = Big, Small, Large, Tall Catalog for Col3 = 101, 102, 103, 
500, 5000 Catalog for Col4 = (no catalog)

Data array =
8 bytes, 4 bytes, 8 bytes, 8 bytes,
8 bytes, 4 bytes, 8 bytes, 8 bytes,
8 bytes, 4 bytes, 8 bytes, 8 bytes,
8 bytes, 4 bytes, 8 bytes, 8 bytes,
8 bytes, 4 bytes, 8 bytes, 8 bytes,
...

I would like to use this kind of file as a source for various ML related 
computations (CART, RandomForrest, Gradient boosting...) and Spark is very 
interesting in this area.
However, I'm a bit lost as to what I should write to have Spark use that file 
format as a source for its computation. Considering that those files are quite 
big (100 million lines, hundreds of gigs on disk), I'd rather not create 
something that writes a new file in a built-in format, but I'd rather write 
some code that makes Spark accept the file as it is.

I looked around and saw the textfile method but it is not applicable to my 
case. I also saw the spark.read.format("libsvm") syntax which tells me that 
there is a list of supported formats known to spark, which I believe are called 
Dataframes, but I could not find any tutorial on this subject.

Would you have any suggestion or links to documentation that would get me 
started?

Regards,
Olivier

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



[How-To] Custom file format as source

2017-06-12 Thread OBones

Hello,

I have an application here that generates data files in a custom binary 
format that provides the following information:


Column list, each column has a data type (64 bit integer, 32 bit string 
index, 64 bit IEEE float, 1 byte boolean)
Catalogs that give modalities for some columns (ie, column 1 contains 
only the following values: A, B, C, D)

Array for actual data, each row has a fixed size according to the columns.

Here is an example:

Col1, 64bit integer
Col2, 32bit string index
Col3, 64bit integer
Col4, 64bit float

Catalog for Col1 = 10, 20, 30, 40, 50
Catalog for Col2 = Big, Small, Large, Tall
Catalog for Col3 = 101, 102, 103, 500, 5000
Catalog for Col4 = (no catalog)

Data array =
8 bytes, 4 bytes, 8 bytes, 8 bytes,
8 bytes, 4 bytes, 8 bytes, 8 bytes,
8 bytes, 4 bytes, 8 bytes, 8 bytes,
8 bytes, 4 bytes, 8 bytes, 8 bytes,
8 bytes, 4 bytes, 8 bytes, 8 bytes,
...

I would like to use this kind of file as a source for various ML related 
computations (CART, RandomForrest, Gradient boosting...) and Spark is 
very interesting in this area.
However, I'm a bit lost as to what I should write to have Spark use that 
file format as a source for its computation. Considering that those 
files are quite big (100 million lines, hundreds of gigs on disk), I'd 
rather not create something that writes a new file in a built-in format, 
but I'd rather write some code that makes Spark accept the file as it is.


I looked around and saw the textfile method but it is not applicable to 
my case. I also saw the spark.read.format("libsvm") syntax which tells 
me that there is a list of supported formats known to spark, which I 
believe are called Dataframes, but I could not find any tutorial on this 
subject.


Would you have any suggestion or links to documentation that would get 
me started?


Regards,
Olivier

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: [Spark JDBC] Does spark support read from remote Hive server via JDBC

2017-06-12 Thread Patrik Medvedev
Hello,

All security checkings disabled, but i still don't have any info in result.


вс, 11 июн. 2017 г. в 14:24, Jörn Franke :

> Is sentry preventing the access?
>
> On 11. Jun 2017, at 01:55, vaquar khan  wrote:
>
> Hi ,
> Pleaae check your firewall security setting sharing link one good link.
>
>
> http://belablotski.blogspot.in/2016/01/access-hive-tables-from-spark-using.html?m=1
>
>
>
> Regards,
> Vaquar khan
>
> On Jun 8, 2017 1:53 AM, "Patrik Medvedev" 
> wrote:
>
>> Hello guys,
>>
>> Can somebody help me with my problem?
>> Let me know, if you need more details.
>>
>>
>> ср, 7 июн. 2017 г. в 16:43, Patrik Medvedev :
>>
>>> No, I don't.
>>>
>>> ср, 7 июн. 2017 г. в 16:42, Jean Georges Perrin :
>>>
 Do you have some other security in place like Kerberos or
 impersonation? It may affect your access.


 jg


 On Jun 7, 2017, at 02:15, Patrik Medvedev 
 wrote:

 Hello guys,

 I need to execute hive queries on remote hive server from spark, but
 for some reasons i receive only column names(without data).
 Data available in table, i checked it via HUE and java jdbc connection.

 Here is my code example:
 val test = spark.read
 .option("url",
 "jdbc:hive2://remote.hive.server:1/work_base")
 .option("user", "user")
 .option("password", "password")
 .option("dbtable", "some_table_with_data")
 .option("driver", "org.apache.hive.jdbc.HiveDriver")
 .format("jdbc")
 .load()
 test.show()


 Scala version: 2.11
 Spark version: 2.1.0, i also tried 2.1.1
 Hive version: CDH 5.7 Hive 1.1.1
 Hive JDBC version: 1.1.1

 But this problem available on Hive with later versions, too.
 Could you help me with this issue, because i didn't find anything in
 mail group answers and StackOverflow.
 Or could you help me find correct solution how to query remote hive
 from spark?

 --
 *Cheers,*
 *Patrick*